GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published 2 days ago • 113
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits Paper • 2512.20578 • Published 18 days ago • 68
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral Paper • 2512.04220 • Published Dec 3, 2025 • 13
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models Paper • 2512.13607 • Published 26 days ago • 30
view article Article Apriel-1.6-15b-Thinker: Cost-efficient Frontier Multimodal Performance Dec 9, 2025 • 82
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano v3. • 7 items • Updated 18 days ago • 56
Tiny-A2D Collection Small diffusion language models adapted from AR models • 4 items • Updated Dec 6, 2025 • 13
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices Paper • 2512.01374 • Published Dec 1, 2025 • 96
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages Paper • 2505.11475 • Published May 16, 2025 • 4
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy Paper • 2507.01352 • Published Jul 2, 2025 • 56
Apertus LLM Collection Democratizing Open and Compliant LLMs for Global Language Environments: 8B and 70B open-data open-weights models, multilingual in >1000 languages • 4 items • Updated Oct 1, 2025 • 320
— Long-context post-training 🧶 — Collection Resources for post-training LLMs with long-context samples • 5 items • Updated Sep 14, 2025 • 6
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Paper • 2508.10975 • Published Aug 14, 2025 • 60
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Paper • 2502.05171 • Published Feb 7, 2025 • 151
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge Paper • 2507.21183 • Published Jul 27, 2025 • 14