OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

SShaobo WangXXuan OuyangTTianyi XuYYuzheng HuJJialin LiuGGuo ChenTTianyu ZhangJJunhao ZhengKKexin YangXXingzhang RenDDayiheng LiuLLinfeng Zhang

Published: February 5, 2026
Authors: 12
Word Count: 38,572

View on arXiv Download PDF

Dynamic data selection aligned with optimizer geometry for efficient language model pre-training.

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

Key Takeaways

1
OPUS selects training data dynamically by scoring in the optimizer's actual geometry rather than raw gradients.
2
Static data filtering cannot adapt to changing model needs throughout training, making dynamic selection essential.
3
Redundancy penalties prevent repeated selection of similar samples, improving diversity and training efficiency.

Limitations

OPUS requires computing preconditioners for specific optimizers like AdamW and Muon, limiting generalizability.
Scaling OPUS to massive datasets demands significant engineering complexity and computational overhead.

Keywords

data selectionoptimizer-induced update spaceeffective updatesstable in-distribution proxyGhost techniqueCountSketchBoltzmann samplingpre-trainingGPT-2Qwen3-8B-BaseFineWebFineWeb-EduSciencePedia

More in Large Language Models

View all

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu +3

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity usi...

Feb 11204

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia +12

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in ...

Feb 9170

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Tengyue Xu, Zhuoyang Qian +17

Autonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, exist...

Jan 28143

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu, Yucheng Wang +8

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: poli...

Jan 13129

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Qinsi Wang, Hancheng Ye +13

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language...

Mar 4109

More Large Language Models papers