Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

WWeidong HuangZZhehan LiHHangxin LiuBBiao HouYYao SuJJingwen Zhang

Published: January 29, 2026
Authors: 6
Word Count: 11,203
Code: Includes code

Revolutionizing humanoid robot training with LIFT framework.

Abstract

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Key Takeaways

1
Introduces LIFT framework for efficient humanoid robot training.
2
Combines SAC pretraining with physics-informed world model.
3
Enhances policy robustness and sample efficiency in finetuning.

Limitations

Requires large-scale parallel simulations for pretraining.
Physics-informed world model may need domain-specific tuning.

Keywords

Proximal Policy OptimizationSoft Actor-Criticon-policy methodsoff-policy RLmodel-based RLlarge-scale parallel simulationzero-shot deploymentsample efficiencylarge-batch updateUpdate-To-Data ratiodeterministic policystochastic explorationphysics-informed world model

More in Reinforcement Learning

View all

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao +3

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference e...

Feb 11167

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang +8

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative...

Mar 3146

Your Group-Relative Advantage Is Biased

Fengkai Yang, Zherui Chen +11

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its vari...

Jan 13128

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin +9

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scar...

Mar 3122

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji +4

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challen...

Jan 2890

More Reinforcement Learning papers