OmniGAIA: Towards Native Omni-Modal AI Agents

XXiaoxi LiWWenxiang JiaoJJiarui JinSShijian WangGGuanting DongJJiajie JinHHao WangYYinuo WangJJi-Rong WenYYuan LuZZhicheng Dou

Published: February 26, 2026
Authors: 11
Word Count: 12,685
Code: Includes code

View on arXiv Download PDF

OmniGAIA benchmarks true omni-modal AI agents combining video, audio, images with tool-integrated reasoning.

Abstract

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

Key Takeaways

1
OmniGAIA is a benchmark with 360 tasks requiring AI agents to seamlessly integrate video, audio, and images for multi-hop reasoning.
2
Current multimodal AI systems fail at complex real-world tasks because they process modalities in silos without tool integration.
3
OmniAtlas, a native omni-modal agent, improves open-source models through active perception and tool-integrated reasoning training.

Limitations

Most existing multimodal benchmarks focus only on perception tasks and lack evaluation of multi-step planning and external tool usage.
Current multimodal LLMs are limited to bi-modal interactions, lacking unified cognitive capabilities needed for general AI assistants.

Keywords

multi-modal LLMsomni-modal perceptioncross-modal reasoningtool-integrated reasoninghindsight-guided tree explorationOmniDPO

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers