MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

YYejin KimWWilbert PumacayOOmar RayyanMMax ArgusWWinson HanEEli VanderBiltJJordi SalvadorAAbhay DeshpandeRRose HendrixSSnehal JauhriSShuo LiuNNur Muhammad Mahi ShafiullahMMaya GuruAAinaz EftekharKKaren FarleyDDonovan ClayJJiafei DuanAArjun GuruPPiper WoltersAAlvaro HerrastiYYing-Chun LeeGGeorgia ChalvatzakiYYuchen CuiAAli FarhadiDDieter FoxRRanjay Krishna

Published: February 11, 2026
Authors: 26

View on arXiv Download PDF

Abstract

Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, ho = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.

Keywords

robot policiesembodied taskssim-to-real correlationzero-shot policiesprompt phrasinginitial joint positionscamera occlusion

More in Robotics & Embodied AI

View all

RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI

Hongzhi Zang, Shu'ang Yu +15

Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheap...

Feb 846

RynnBrain: Open Embodied Foundation Models

Ronghao Dang, Jiayan Guo +24

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning ...

Feb 1336

RoboPocket: Improve Robot Policies Instantly with Your Phone

Junjie Fang, Wendi Chen +8

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predo...

Mar 530

SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation

Mu Huang, Hui Wang +6

Simulating deformable objects under rich interactions remains a fundamental challenge for real-to-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Exist...

Feb 228

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Runpei Dong, Ziyan Li +2

Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D...

Feb 1826

More Robotics & Embodied AI papers