Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

DDawid J. KopiczkoSSagar VazeTTijmen BlankevoortYYuki M. Asano

Published: February 11, 2026
Authors: 4
Word Count: 6,378

Data repetition outperforms data scaling for fine-tuning language models on chain-of-thought reasoning.

Abstract

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Key Takeaways

1
Training language models on repeated data beats training once on larger datasets for chain-of-thought reasoning tasks.
2
Olmo3-7B trained 128 epochs on 400 samples outperformed single-epoch training on 51,200 samples by 12-26 percentage points.
3
High-quality reasoning data is scarce and expensive, making data repetition more practical than scaling for supervised fine-tuning.

Limitations

Study focused only on three models and three benchmarks, limiting generalizability across different architectures and evaluation metrics.
Reasoning data from Dolci SFT dataset may not represent all types of training data or real-world applications.

Keywords

supervised fine-tuningchain-of-thought datareasoning language modelstraining epochstoken accuracymemorizationgeneralizationAIMEGPQAcatastrophic forgetting

More in Large Language Models

View all

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang +10

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic s...

Feb 5260

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu +3

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity usi...

Feb 11204

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia +12

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in ...

Feb 9170

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Tengyue Xu, Zhuoyang Qian +17

Autonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, exist...

Jan 28143

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu, Yucheng Wang +8

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: poli...

Jan 13129

More Large Language Models papers