AI Agents

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

AAniketh GarikaparthiMManasi PatwardhanAArman Cohan
Published
February 16, 2026
Authors
3
Word Count
23,984
Code
Includes code

ResearchGym benchmarks AI agents on real research tasks, finding frontier models succeed only 6.7% of the time.

Abstract

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

Key Takeaways

  • 1

    ResearchGym evaluates AI agents on complete research cycles from ideation through iteration using real papers.

  • 2

    Frontier language models succeed in only 6.7% of runs, revealing unreliability in autonomous research tasks.

  • 3

    The benchmark uses recent 2025 papers with isolated Docker containers and realistic computational constraints.

Limitations

  • Most benchmarks require massive compute, use gameable LLM judges, or contain solutions in training data.

  • Agents completed only 26.5% of sub-tasks on average across the evaluated research tasks.

Keywords

ResearchGymAI agentsend-to-end researchICMLICLRACLdatasetsevaluation harnessbaseline implementationscontainerized task environmentssub-tasksGPT-5hypothesis generationexperimental executionautonomous agentscapability-reliability gapcontext lengthClaude CodeCodex

More in AI Agents

View all
ResearchGym: Evaluating Language Model Agents on Real-World AI Research | Paperchime