Multimodal AI

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

LLai WeiLLiangbo HeJJun LanLLingzhong DongYYutong CaiSSiyuan LiHHuijia ZhuWWeiqiang WangLLinghe KongYYue WangZZhuosheng ZhangWWeiran Huang
Published
February 12, 2026
Authors
12
Word Count
16,627
Code
Includes code

Train-time zooming distillation enables fine-grained multimodal perception without inference latency.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

Key Takeaways

  • 1

    Region-to-Image Distillation moves zooming from inference to training, enabling single-pass fine-grained perception without latency overhead.

  • 2

    Teacher models generate high-quality VQA data on micro-crops where fine details are unambiguous, then distill back to full images.

  • 3

    This approach achieves accuracy benefits of iterative zooming while maintaining real-time inference speed through intelligent data synthesis.

Limitations

  • Referential ambiguity arises when region-specific questions are applied to full images, requiring careful distillation strategies.

  • The approach depends on high-quality teacher models and consensus filtering, which may limit scalability and data diversity.

Keywords

Multimodal Large Language Modelsvisual question answeringfine-grained perceptionThinking-with-Imagesregion-to-image distillationmicro-cropped regionsteacher-student distillationZoomBenchvisual reasoningGUI agents

More in Multimodal AI

View all
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception | Paperchime