AI Agents

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

GGuoxin ChenFFanzhe MengJJiale ZhaoMMinghao LiDDaixuan ChengHHuatong SongJJie ChenYYuzhi LinHHui ChenXXin ZhaoRRuihua SongCChang LiuCCheng ChenKKai JiaJJi-Rong Wen
Published
March 3, 2026
Authors
15
Word Count
8,223
Code
Includes code

Current code agents fail beyond single-repo fixes; BeyondSWE reveals 45% success on real-world tasks.

Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Key Takeaways

  • 1

    Current code agents achieve only 45% success on real-world tasks beyond single-repository bug fixing, despite high SWE-bench scores.

  • 2

    BeyondSWE benchmark evaluates agents across four dimensions: cross-repo fixes, domain-specific problems, dependency migration, and full system generation.

  • 3

    Search augmentation provides inconsistent performance gains, revealing a critical disconnect between LLM search and coding capabilities.

Limitations

  • Existing SWE-bench benchmarks focus narrowly on isolated function-level fixes within single repositories, missing real-world complexity.

  • Current code agents lack effective integration of search capabilities with coding proficiency for developer-like workflows.

Keywords

code agentsbenchmarkscross-repository reasoningdomain-specialized problem solvingdependency-driven migrationfull-repository generationSearchSWEdeep searchcoding abilitiesexternal knowledgedeveloper-like workflowsreasoningperformance evaluation

More in AI Agents

View all
BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? | Paperchime