Reinforcement Learning

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

YYiju GuoTTianyi HuZZexu SunYYankai Lin
Published
January 29, 2026
Authors
4
Word Count
6,879
Code
Includes code

LENS boosts LLM reasoning by purifying noisy prompts.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6times speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

Key Takeaways

  • 1

    LENS significantly improves model performance by reducing interference tokens.

  • 2

    LENS achieves faster convergence and higher accuracy with fewer resources.

  • 3

    Purified rollouts enhance policy optimization in RLVR frameworks.

Limitations

  • Experiments were conducted on specific model families and benchmarks.

  • Generalizability to other models and tasks needs further validation.

Keywords

Reinforcement Learning with Verifiable Rewardsexplorationrollout budgetsampling successpolicy optimizationinterference tokensprompt tokensGRPOLENS

More in Reinforcement Learning

View all
Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification | Paperchime