arxiv:2604.13016

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Published on Apr 14

· Submitted by

Bingxiang He on Apr 15

#2 Paper of the day

Tsinghua NLP Group

Upvote

Authors:

Yaxuan Li ,

Yuxin Zuo ,

Bingxiang He ,

Wenkai Yang ,

Abstract

On-policy distillation dynamics in large language models depend on compatible thinking patterns between teacher and student models, with successful distillation characterized by alignment on high-probability tokens and requiring teachers to provide novel capabilities beyond student training data.

AI-generated summary

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

View arXiv page View PDF GitHub 110 Add to collection

Community

hbx

Paper author Paper submitter 8 days ago

We investigate the dynamics and mechanisms of on-policy distillation (OPD) of LLMs, and propose practical strategies to recover failing OPD.

hbx

Paper author 8 days ago

Check the repo: https://github.com/thunlp/OPD

librarian-bot

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

HuanxinSheng

about 23 hours ago

I have a question in sec. 5.1 and figure 8. After SFT on 20k prompts, the student model has reached much better performance on three benchmarks and only improve a little in later OPD training, while pure OPD training increase faster. What if the pure OPD also has the same 20k prompts to train? Meanwhile, it seems the three metrics of pure OPD training begin to be stable after 120 steps.

hbx

Paper author about 23 hours ago

Thanks for the great question! We actually ran pure OPD much longer, with up to 500 steps (batch size 64), and it had already converged well before that. As you noted, the metrics stabilize around step 120, and further training brings no additional gains. So training on more prompts may not change the conclusion.

Also worth noting: OPD is far more expensive than SFT per sample (due to on-policy rollouts). In GPU hours, running 200K steps of OPD would cost significantly more than the 200K SFT cold start. Our conclusion that SFT+OPD achieves a higher ceiling is based on training both settings to saturation, not early stopping.