arxiv:2602.05400

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Published on Feb 5

· Submitted by

Xuan Ouyang on Feb 11

Qwen

Upvote

Authors:

Abstract

OPUS is a dynamic data selection framework that improves pre-training efficiency by scoring data candidates based on optimizer-induced update projections in a stable proxy-derived target space, achieving superior performance with reduced computational overhead.

AI-generated summary

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

View arXiv page View PDF Add to collection

Community

YoungXuan

Paper submitter about 1 hour ago

In this paper, we argue that LLM pre-training is entering a “data-wall” regime where readily available high-quality public text is approaching exhaustion, so progress must shift from more tokens to better tokens chosen at the right time. While most existing pipelines either (i) apply static, training-agnostic quality filters or (ii) use dynamic selection criteria defined in raw gradient space, modern LLMs are actually trained with adaptive optimizers like AdamW or Muon whose preconditioning reshapes the effective update direction—creating a fundamental mismatch between “how we score data” and “how training truly updates the model.” To bridge this gap, we introduce OPUS (Optimizer-induced Projected Utility Selection), a dynamic selection framework that defines data utility directly in the optimizer-induced update space: a sample is valuable insofar as its optimizer-shaped effective update aligns with the descent direction of a stable, high-quality target distribution (our proxy).

Concretely, OPUS operationalizes this idea through a principled objective, a scalable estimator, and a diversity-preserving selection rule. Our key contributions are: (1) an optimizer-aware utility for dynamic selection, with closed-form approximations for effective update directions under AdamW and Muon, aligning scoring with real training geometry; (2) BENCH-PROXY, an in-distribution proxy construction method that retrieves benchmark-aligned samples from the pre-training corpus to stabilize the target direction; (3) scalable utility estimation using the Ghost technique + CountSketch projections to avoid per-sample gradient materialization; and (4) Boltzmann sampling with redundancy control to prevent diversity collapse under non-stationary streams. Empirically, OPUS delivers strong data/compute efficiency: it reports only ~4.7% additional compute overhead for selection while achieving large gains across datasets, optimizers, and scales—including improved accuracy (+2.2% average over 10 benchmarks and 8× compute reduction in one highlighted setting), outperforming industrial static/dynamic baselines and even matching or exceeding much longer-token training in several regimes.