Multimodal AI

XR: Cross-Modal Agents for Composed Image Retrieval

ZZhongyu YangWWei PangYYingfang Yuan
Published
January 20, 2026
Authors
3
Word Count
10,390
Code
Includes code

XR revolutionizes image search with cross-modal agents.

Abstract

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.

Key Takeaways

  • 1

    XR enhances Composed Image Retrieval with multi-agent framework.

  • 2

    Improves fine-grained, edit-specific image retrieval accuracy.

  • 3

    Outperforms baselines on multiple CIR benchmarks.

Limitations

  • Currently tailored to image-text composition only.

  • Relies on large models, potentially introducing biases.

Keywords

compositional image retrievalmulti-agent frameworkcross-modal generationhybrid matchingtargeted reasoning

More in Multimodal AI

View all
XR: Cross-Modal Agents for Composed Image Retrieval | Paperchime