DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

SShenyuan GaoWWilliam LiangKKaiyuan ZhengAAyaan MalikSSeonghyeon YeSSihyun YuWWei-Cheng TsengYYuzhu DongKKaichun MoCChen-Hsuan LinQQianli MaSSeungjun NahLLoic MagneJJiannan XiangYYuqi XieRRuijie ZhengDDantong NiuYYou Liang TanKK. R. ZentnerGGeorge KurianSSuneel IndupuruPPooya JannatyJJinwei GuJJun ZhangJJitendra MalikPPieter AbbeelMMing-Yu LiuYYuke ZhuJJoel JangLLinxi "Jim" Fan

Published: February 6, 2026
Authors: 30
Word Count: 13,837

View on arXiv Download PDF

Pretraining robots using large-scale human videos.

Abstract

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

Key Takeaways

1
Utilizes large-scale human videos for robot pretraining.
2
Introduces continuous latent actions for effective transfer.
3
Achieves real-time prediction with high visual quality.

Limitations

Requires extensive human video data for pretraining.
Dependent on the quality and diversity of human videos.

Keywords

world modelegocentric videoscontinuous latent actionsaction labelsdistillation pipelinereal-time speedteleoperationpolicy evaluationmodel-based planningout-of-distribution benchmarks

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers