BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

GGuoxin ChenFFanzhe MengJJiale ZhaoMMinghao LiDDaixuan ChengHHuatong SongJJie ChenYYuzhi LinHHui ChenXXin ZhaoRRuihua SongCChang LiuCCheng ChenKKai JiaJJi-Rong Wen

Published: March 3, 2026
Authors: 15
Word Count: 8,223
Code: Includes code

View on arXiv Download PDF

Current code agents fail beyond single-repo fixes; BeyondSWE reveals 45% success on real-world tasks.

Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Key Takeaways

1
Current code agents achieve only 45% success on real-world tasks beyond single-repository bug fixing, despite high SWE-bench scores.
2
BeyondSWE benchmark evaluates agents across four dimensions: cross-repo fixes, domain-specific problems, dependency migration, and full system generation.
3
Search augmentation provides inconsistent performance gains, revealing a critical disconnect between LLM search and coding capabilities.

Limitations

Existing SWE-bench benchmarks focus narrowly on isolated function-level fixes within single repositories, missing real-world complexity.
Current code agents lack effective integration of search capabilities with coding proficiency for developer-like workflows.

Keywords

code agentsbenchmarkscross-repository reasoningdomain-specialized problem solvingdependency-driven migrationfull-repository generationSearchSWEdeep searchcoding abilitiesexternal knowledgedeveloper-like workflowsreasoningperformance evaluation

More in AI Agents

View all

LongCat-Flash-Thinking-2601 Technical Report

Meituan LongCat Team, Anchun Gui +160

We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves ...

Jan 23149

Agentic Reasoning for Large Language Models

Tianxin Wei, Ting-Wei Li +27

Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world se...

Jan 18149

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li +213

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: ...

Feb 11140

UI-Venus-1.5 Technical Report

Veuns-Team, Changlong Gao +25

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In ...

Feb 9140

daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng, Dayuan Fu +15

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and ...

Jan 26113

More AI Agents papers