Multimodal AI

OmniGAIA: Towards Native Omni-Modal AI Agents

XXiaoxi LiWWenxiang JiaoJJiarui JinSShijian WangGGuanting DongJJiajie JinHHao WangYYinuo WangJJi-Rong WenYYuan LuZZhicheng Dou
Published
February 26, 2026
Authors
11
Word Count
12,685
Code
Includes code

OmniGAIA benchmarks true omni-modal AI agents combining video, audio, images with tool-integrated reasoning.

Abstract

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

Key Takeaways

  • 1

    OmniGAIA is a benchmark with 360 tasks requiring AI agents to seamlessly integrate video, audio, and images for multi-hop reasoning.

  • 2

    Current multimodal AI systems fail at complex real-world tasks because they process modalities in silos without tool integration.

  • 3

    OmniAtlas, a native omni-modal agent, improves open-source models through active perception and tool-integrated reasoning training.

Limitations

  • Most existing multimodal benchmarks focus only on perception tasks and lack evaluation of multi-step planning and external tool usage.

  • Current multimodal LLMs are limited to bi-modal interactions, lacking unified cognitive capabilities needed for general AI assistants.

Keywords

multi-modal LLMsomni-modal perceptioncross-modal reasoningtool-integrated reasoninghindsight-guided tree explorationOmniDPO

More in Multimodal AI

View all
OmniGAIA: Towards Native Omni-Modal AI Agents | Paperchime