MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

BBaorui MaJJiahui YangDDonglin DiXXuancheng ZhangJJianxun CuiHHao LiYYan XieWWei Chen

Published: January 29, 2026
Authors: 8
Word Count: 18,857

Scalable pretraining paradigm achieves robust metric depth estimation.

Abstract

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

Key Takeaways

1
Simple pretraining unlocks robust metric depth perception.
2
Large-scale data and decoupling biases yield state-of-the-art results.
3
Versatile model excels across diverse downstream tasks.

Limitations

Requires large, diverse datasets and significant computational resources.
Performance may degrade under extreme conditions or unseen sensors.

Keywords

vision foundation modelsmetric depth estimationsparse metric promptpretraining frameworkdepth completionsuper-resolutionRadar-camera fusionmonocular depth estimationcamera intrinsics recovery3D reconstructionVLA planningvisual encoderMultimodal Large Language Modelspatial intelligence

More in Computer Vision

View all

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang +54

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual env...

Feb 23308

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Xiangyan Qu, Zhenlong Yuan +10

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2...

Feb 24117

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann +6

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memor...

Mar 344

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Yucheng Wang, Zedong Wang +3

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs ph...

Mar 930

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu +4

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene r...

Mar 129

More Computer Vision papers