Multimodal AI

Learning Situated Awareness in the Real World

CChuhan LiRRuilin HanJJoy HsuYYongyuan LiangRRajiv DhawanJJiajun WuMMing-Hsuan YangXXin Eric Wang
Published
February 18, 2026
Authors
8
Word Count
26,571
Code
Includes code

SAW-Bench measures AI's ability to understand space from the observer's perspective in real-world navigation.

Abstract

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

Key Takeaways

  • 1

    AI systems struggle with observer-centric spatial reasoning despite excelling at object recognition and environment analysis.

  • 2

    SAW-Bench introduces 786 egocentric videos with 2,071 annotations to measure situated awareness across six distinct tasks.

  • 3

    Accurate spatial reasoning from the observer's perspective is critical for robotics, AR, and autonomous navigation systems.

Limitations

  • Existing benchmarks focus on environment-centric relationships rather than observer-centric spatial understanding and positioning.

  • Current multimodal foundation models fail to track observer position and orientation changes during movement through environments.

Keywords

multimodal foundation modelsegocentric videosobserver-centric relationshipssituated awarenessspatial reasoningcamera geometryreal-world videosquestion-answer pairs

More in Multimodal AI

View all
Learning Situated Awareness in the Real World | Paperchime