Papers
arxiv:2604.13016

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Published on Apr 14
· Submitted by
Bingxiang He
on Apr 15
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

On-policy distillation dynamics in large language models depend on compatible thinking patterns between teacher and student models, with successful distillation characterized by alignment on high-probability tokens and requiring teachers to provide novel capabilities beyond student training data.

AI-generated summary

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

Community

Paper author Paper submitter

We investigate the dynamics and mechanisms of on-policy distillation (OPD) of LLMs, and propose practical strategies to recover failing OPD.

·
Paper author

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

I have a question in sec. 5.1 and figure 8. After SFT on 20k prompts, the student model has reached much better performance on three benchmarks and only improve a little in later OPD training, while pure OPD training increase faster. What if the pure OPD also has the same 20k prompts to train? Meanwhile, it seems the three metrics of pure OPD training begin to be stable after 120 steps.

·
Paper author

Thanks for the great question! We actually ran pure OPD much longer, with up to 500 steps (batch size 64), and it had already converged well before that. As you noted, the metrics stabilize around step 120, and further training brings no additional gains. So training on more prompts may not change the conclusion.

Also worth noting: OPD is far more expensive than SFT per sample (due to on-policy rollouts). In GPU hours, running 200K steps of OPD would cost significantly more than the 200K SFT cold start. Our conclusion that SFT+OPD achieves a higher ceiling is based on training both settings to saturation, not early stopping.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.13016
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.13016 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.13016 in a Space README.md to link it from this page.

Collections including this paper 5