Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

LLai WeiLLiangbo HeJJun LanLLingzhong DongYYutong CaiSSiyuan LiHHuijia ZhuWWeiqiang WangLLinghe KongYYue WangZZhuosheng ZhangWWeiran Huang

Published: February 12, 2026
Authors: 12
Word Count: 16,627
Code: Includes code

View on arXiv Download PDF

Train-time zooming distillation enables fine-grained multimodal perception without inference latency.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

Key Takeaways

1
Region-to-Image Distillation moves zooming from inference to training, enabling single-pass fine-grained perception without latency overhead.
2
Teacher models generate high-quality VQA data on micro-crops where fine details are unambiguous, then distill back to full images.
3
This approach achieves accuracy benefits of iterative zooming while maintaining real-time inference speed through intelligent data synthesis.

Limitations

Referential ambiguity arises when region-specific questions are applied to full images, requiring careful distillation strategies.
The approach depends on high-quality teacher models and consensus filtering, which may limit scalability and data diversity.

Keywords

Multimodal Large Language Modelsvisual question answeringfine-grained perceptionThinking-with-Imagesregion-to-image distillationmicro-cropped regionsteacher-student distillationZoomBenchvisual reasoningGUI agents

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers