Multimodal AI

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

AAadarsh SahooGGeorgia Gkioxari
Published
February 13, 2026
Authors
2
Word Count
12,722

Teaching AI systems to understand intent and affordances through conversational image segmentation at scale.

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/

Key Takeaways

  • 1

    Conversational image segmentation enables AI systems to understand functional properties and user intent beyond basic object recognition.

  • 2

    A five-stage automated data engine generates pixel-accurate masks with natural language prompts without human supervision at scale.

  • 3

    Five concept families—entities, spatial relationships, interactions, affordances, and physics—capture the full spectrum of conversational reasoning about images.

Limitations

  • Traditional vision-language models struggle with complex reasoning about affordances, physics constraints, and functional properties in real-world scenarios.

  • Creating pixel-accurate masks paired with reasoning-heavy natural language prompts is prohibitively expensive with human annotation alone.

Keywords

conversational image segmentationreferring image groundinglanguage-guided segmentationsegmentation priorslanguage understandingAI-powered data engineprompt-mask pairs

More in Multimodal AI

View all
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision | Paperchime