Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

202 Papers

Showing 20 of 20 papers

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Yuchen Yang, Yuqing Shao, Duxiu Huang +11 more

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interact...

spatial intelligencevision-language modelsCourtSICourtSI-BenchCourtSI-Ext+3 more

Mar 10, 202621

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Zongxia Li, Hongyang Du, Chengsong Huang +8 more

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs in...

self-evolvingLarge Language ModelsVision Language Modelsreinforcement learningmultimodal reasoning+5 more

Mar 10, 202636

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu +4 more

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both sy...

multimodal large language modelsmodality gapvisual text understandingself-distillationreasoning traces+1 more

Mar 10, 202621

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen +26 more

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that demo...

Unified multimodal modelsMultimodal Large Language ModelMMDiT-based visual generation headChain-of-Thoughtvisual representations+5 more

Mar 10, 202625

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Yuanyuan Gao, Hao Li, Yifei Liu +14 more

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than sy...

3D Gaussian Splatting3DGSVision-Language ModelsVLMsspatial reasoning+8 more

Mar 8, 202667

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li, Zuwei Long, Yunhang Shen +6 more

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies h...

multimodal large language modelsautoregressive architecturediscrete diffusion modelsmask-based discrete diffusion modelsmultimodal systems+3 more

Mar 6, 202637

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Boqiang Zhang, Lei Ke, Ruihan Yang +5 more

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing pra...

Vision Language Modelcontrastive learningvision encodertext-only LLMmultimodal understanding+5 more

Mar 6, 202676

Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison, Neel Joshi +3 more

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimoda...

multimodal reasoning modelopen-weight modelvision-language tasksscientific reasoningmathematical reasoning+8 more

Mar 4, 202619

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai, Hongze Fu, Jayjun Lee +6 more

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluati...

vision-language-action modelsmemory mechanismslong-horizon taskshistory-dependent scenariosstandardized benchmark+6 more

Mar 4, 202616

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li, Wanbo Zhang +11 more

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce U...

Unified multimodal modelsVision-Language Modelsgeneration-to-understandingG2U evaluationspatial intelligence+5 more

Mar 3, 202674

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan, Yuhong Dai, Qi Ran +6 more

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated cont...

multimodal language modelsreal-time interactive agentsenvironment perceptionvideo understandingresponse latency+1 more

Mar 3, 202625

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen +18 more

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the fact...

multimodal modelsTransfusion frameworknext-token predictiondiffusionrepresentation Autoencoder+8 more

Mar 3, 202652

Utonia: Toward One Encoder for All Point Clouds

Yujia Zhang, Xiaoyang Wu, Yunhan Yang +6 more

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiD...

point transformer encoderself-supervised learningrepresentation spacecross-domain transferembodied reasoning+4 more

Mar 3, 2026133

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang, Chenjie Cao, Tengfei Wang +4 more

Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generate...

Video Diffusion Modelscamera-guided video generation3D reconstructiongeometric memory modulesglobal-geometric memory+7 more

Mar 2, 202613

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Jiachun Li, Shaoping Huang, Zhuoran Jin +5 more

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely ...

multimodal large language modelsmultimodal multi-image reasoningcomprehensive benchmarkreasoning capabilitiesreal-life scenarios+9 more

Mar 2, 202639

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Tianyi Xu, Rong Shan, Junjie Wu +11 more

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated ...

personalized photo retrievalmulti-source reasoningintent-driven queriesvisual semanticsspatial-temporal metadata+5 more

Mar 2, 202620

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

Zebin You, Xiaolu Zhang, Jun Zhou +2 more

We present LLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coup...

Mixture of Diffusionomni diffusion modeldiscrete masked diffusioncontinuous diffusionattention backbone+4 more

Mar 1, 202614

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Changwoo Baek, Jouwon Song, Sohyeon Kim +1 more

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these ap...

visual token pruninglarge vision-language modelseffective rankattention score entropyfeature diversity+3 more

Mar 1, 202611

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma, Haiwen Xia, Hewei Gao +9 more

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multim...

music reward modelingCompositional Multimodal Instructionpreference datasetpseudo-labeled sampleshuman-annotated corpus+4 more

Feb 28, 202627

MediX-R1: Open Ended Medical Reinforcement Learning

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed +5 more

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a compos...

Reinforcement Learningvision-language backboneGroup Based RLLLM-based accuracy rewardmedical embedding-based semantic reward+5 more

Feb 26, 202620

Page 1 of 11Next

View all categories

Latest Multimodal AI Research | Multimodal AI Papers