Multimodal AI

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

JJinlong MaYYu ZhangXXuefeng BaiKKehai ChenYYuwei WangZZeming LiuJJun YuMMin Zhang
Published
February 4, 2026
Authors
8
Word Count
10,683

Enhancing MLLMs for accurate cross-modal entity recognition.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit modality bias, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning (MCR), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.

Key Takeaways

  • 1

    Proposes end-to-end solution for grounded multimodal NER.

  • 2

    Introduces Modality-aware Consistency Reasoning (MCR) framework.

  • 3

    Reduces modality bias and unimodal shortcuts in MLLMs.

Limitations

  • Requires large-scale data for effective training.

  • Computationally intensive due to end-to-end nature.

Keywords

Multimodal Large Language ModelsGMNERmodality biascross-modal reasoningMulti-style Reasoning Schema InjectionConstraint-guided Verifiable OptimizationGroup Relative Policy Optimization

More in Multimodal AI

View all
Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition | Paperchime