Multimodal AI

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

WWeicai YanYYuhong DaiQQi RanHHaodong LiWWang LinHHao LiaoXXing XieTTao JinJJianxun Lian
Published
March 3, 2026
Authors
9
Word Count
16,886
Code
Includes code

Proact-VL enables real-time AI companions that know when and how much to speak.

Abstract

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Key Takeaways

  • 1

    Proact-VL balances proactive response timing with low-latency generation for real-time AI companions in gaming.

  • 2

    The Live Gaming Dataset provides 561 hours of professional commentary across twelve games for training and evaluation.

  • 3

    A lightweight FLAG token mechanism autonomously decides when to respond based on visual and contextual cues.

Limitations

  • Evaluation focuses on gaming scenarios; generalization to other real-time interactive applications remains unclear.

  • Script cuts off mid-explanation of the proactive mechanism; complete technical details are unavailable.

Keywords

multimodal language modelsreal-time interactive agentsenvironment perceptionvideo understandingresponse latencyproactive AI companions

More in Multimodal AI

View all
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions | Paperchime