Reinforcement Learning

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

LLiangzhi ShiSShuaihang ChenFFeng GaoYYinuo ChenKKang ChenTTonghe ZhangHHongzhi ZhangWWeinan ZhangCChao YuYYu Wang
Published
February 13, 2026
Authors
10
Word Count
10,508

Co-train VLA robots using RL in simulation while anchoring to real-world demonstrations.

Abstract

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \textit{RL}-based sim-real \textit{Co}-training (RL-Co) framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and π_{0.5}, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on π_{0.5}. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

Key Takeaways

  • 1

    RLinf-Co combines reinforcement learning with simulation-real co-training to improve VLA model performance on specific robotic tasks.

  • 2

    Two-stage approach: supervised fine-tuning on mixed data, then RL in simulation with real-world regularization prevents catastrophic forgetting.

  • 3

    Simulation generates unlimited interactive experience through trial and error while real demonstrations anchor the policy to ground truth.

Limitations

  • Pure simulation training causes poor real-world performance due to physics inaccuracies and environmental differences like lighting.

  • Supervised learning alone struggles with covariate shift when robots deviate from demonstrations, compounding errors over time.

Keywords

vision-language-actionsim-real co-trainingreinforcement learningsupervised fine-tuningpolicy optimizationcatastrophic forgettingreal-world data efficiency

More in Reinforcement Learning

View all
RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models | Paperchime