Reinforcement Learning

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

WWeidong HuangZZhehan LiHHangxin LiuBBiao HouYYao SuJJingwen Zhang
Published
January 29, 2026
Authors
6
Word Count
11,203
Code
Includes code

Revolutionizing humanoid robot training with LIFT framework.

Abstract

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Key Takeaways

  • 1

    Introduces LIFT framework for efficient humanoid robot training.

  • 2

    Combines SAC pretraining with physics-informed world model.

  • 3

    Enhances policy robustness and sample efficiency in finetuning.

Limitations

  • Requires large-scale parallel simulations for pretraining.

  • Physics-informed world model may need domain-specific tuning.

Keywords

Proximal Policy OptimizationSoft Actor-Criticon-policy methodsoff-policy RLmodel-based RLlarge-scale parallel simulationzero-shot deploymentsample efficiencylarge-batch updateUpdate-To-Data ratiodeterministic policystochastic explorationphysics-informed world model

More in Reinforcement Learning

View all
Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control | Paperchime