Generative AI

RISE-Video: Can Video Generators Decode Implicit World Rules?

MMingxin LiuSShuran MaSShibei MengXXiangyu ZhaoZZicheng ZhangSShaofeng ZhangZZhihang ZhongPPeixian ChenHHaoyu CaoXXing SunHHaodong DuanXXue Yang
Published
February 5, 2026
Authors
12

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

Keywords

Text-Image-to-Videomultimodal modelsreasoning alignmenttemporal consistencyphysical rationalityvisual qualitylarge multimodal modelsautomated evaluation

More in Generative AI

View all
RISE-Video: Can Video Generators Decode Implicit World Rules? | Paperchime