GEBench: Benchmarking Image Generation Models as GUI Environments

HHaodong LiJJingwei WuQQuan SunGGuopeng LiJJuanxi TianHHuanyu ZhangYYanlin LaiRRuichuan AnHHongbo PengYYuhong DaiCChenxi LiCChunmei QingJJia WangZZiyang MengZZheng GeXXiangyu ZhangDDaxin Jiang

Published: February 9, 2026
Authors: 17
Word Count: 14,113

View on arXiv Download PDF

GEBench benchmarks image models as GUI simulators for autonomous agent training.

Abstract

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.

Key Takeaways

1
Image generation models excel at single steps but fail maintaining consistency across multiple GUI interactions.
2
Traditional benchmarks measure photorealism, not GUI simulation fidelity needed for autonomous agent training.
3
GEBench provides systematic evaluation across five task categories to test GUI generation reliability.

Limitations

Existing video generation benchmarks focus on continuous natural motion, not discrete GUI state changes.
Current image generation metrics like FID scores don't measure GUI consistency or text rendering accuracy.

Keywords

GUI generationtemporal coherencedynamic interactionvisual fidelityGUI-specific contextsGEBenchGE-Scoregoal achievementinteraction logiccontent consistencyUI plausibilityvisual quality

More in Generative AI

View all

Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan, Yuanyang Yin +4

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We mak...

Mar 4136

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Yiying Yang, Wei Cheng +6

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON...

Mar 2111

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu +4

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or i...

Jan 28107

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev +8

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside high...

Jan 558

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal +6

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the dis...

Feb 1954

More Generative AI papers