Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Abstract
On-policy distillation dynamics in large language models depend on compatible thinking patterns between teacher and student models, with successful distillation characterized by alignment on high-probability tokens and requiring teachers to provide novel capabilities beyond student training data.
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.
Community
We investigate the dynamics and mechanisms of on-policy distillation (OPD) of LLMs, and propose practical strategies to recover failing OPD.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models (2026)
- Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes (2026)
- Entropy-Aware On-Policy Distillation of Language Models (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting (2026)
- Fast and Effective On-policy Distillation from Reasoning Prefixes (2026)
- Scaling Reasoning Efficiently via Relaxed On-Policy Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
I have a question in sec. 5.1 and figure 8. After SFT on 20k prompts, the student model has reached much better performance on three benchmarks and only improve a little in later OPD training, while pure OPD training increase faster. What if the pure OPD also has the same 20k prompts to train? Meanwhile, it seems the three metrics of pure OPD training begin to be stable after 120 steps.
Thanks for the great question! We actually ran pure OPD much longer, with up to 500 steps (batch size 64), and it had already converged well before that. As you noted, the metrics stabilize around step 120, and further training brings no additional gains. So training on more prompts may not change the conclusion.
Also worth noting: OPD is far more expensive than SFT per sample (due to on-policy rollouts). In GPU hours, running 200K steps of OPD would cost significantly more than the 200K SFT cold start. Our conclusion that SFT+OPD achieves a higher ceiling is based on training both settings to saturation, not early stopping.
Get this paper in your agent:
hf papers read 2604.13016 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper