GameDevBench: Evaluating Agentic Capabilities Through Game Development

WWayne ChiYYixiong FangAArnav YayavaramSSiddharth YayavaramSSeth KartenQQiuhong Anna WeiRRunkun ChenAAlexander WangVValerie ChenAAmeet TalwalkarCChris Donahue

Published: February 11, 2026
Authors: 11
Word Count: 8,965
Code: Includes code

View on arXiv Download PDF

GameDevBench evaluates AI agents on complete game development tasks using deterministic verification in Godot.

Abstract

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

Key Takeaways

1
GameDevBench is the first comprehensive benchmark evaluating AI agents on complete multimodal game development tasks using deterministic verification.
2
Current AI models solve only about fifty percent of game development tasks, revealing fundamental gaps in agent evaluation and training.
3
Previous benchmarks focused on unimodal tasks, but real-world development requires understanding visual assets, code, and complex engine systems together.

Limitations

Previous approaches tackled narrow subproblems like procedural generation instead of addressing the full game development pipeline comprehensively.
Most existing benchmarks focus on unimodal text and code tasks, failing to capture the multimodal complexity of real development work.

Keywords

coding agentsmultimodal counterpartsevaluation testbedsoftware developmentmultimodal understandinggame developmentGameDevBenchweb tutorialsvideo tutorialsmultimodal complexityagentic game development

More in AI Agents

View all

LongCat-Flash-Thinking-2601 Technical Report

Meituan LongCat Team, Anchun Gui +160

We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves ...

Jan 23149

Agentic Reasoning for Large Language Models

Tianxin Wei, Ting-Wei Li +27

Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world se...

Jan 18149

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li +213

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: ...

Feb 11140

UI-Venus-1.5 Technical Report

Veuns-Team, Changlong Gao +25

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In ...

Feb 9140

daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng, Dayuan Fu +15

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and ...

Jan 26113

More AI Agents papers