Large Language Models

TTCS: Test-Time Curriculum Synthesis for Self-Evolving

CChengyi YangZZhishang XiangYYunbo TangZZongpei TengCChengsong HuangFFei LongYYuhan LiuJJinsong Su
Published
January 30, 2026
Authors
8
Word Count
8,895
Code
Includes code

TTCS enables LLMs to self-evolve during inference.

Abstract

Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.

Key Takeaways

  • 1

    TTCS dynamically constructs a curriculum for self-evolving models.

  • 2

    Significant accuracy boost on mathematical benchmarks.

  • 3

    Strong generalization across different datasets.

Limitations

  • Relies on quality of synthetic questions generated.

  • High computational requirements for synthetic question generation.

Keywords

test-time traininglarge language modelspseudo-labelsself-consistency rewardsquestion synthesizerreasoning solveriterative optimizationtest-time curriculamathematical benchmarksgeneral-domain tasks

More in Large Language Models

View all
TTCS: Test-Time Curriculum Synthesis for Self-Evolving | Paperchime