The Autonomy Review

Your Research Agent Beats Human Baselines 7% of the Time, and Congress Just Tried to Freeze the Data Centers

ResearchGym: When Your Agent Does Real Research, It Works — Once in Fifteen Tries

We covered the Princeton reliability framework yesterday — twelve metrics showing that capability gains have not produced proportional reliability improvements. ResearchGym quantifies what that gap looks like when agents attempt the hardest task of all: original research.

Aniketh Garikaparthi and Manasi Patwardhan at TCS Research, and Arman Cohan at Yale, built ResearchGym: five containerized research tasks drawn from ICML, ICLR, and ACL oral and spotlight papers, with baselines preserved and the proposed method withheld. The agent must propose hypotheses, run experiments, and beat human baselines using objective, execution-based grading — no LLM judges.

A GPT-5-powered agent improved over provided baselines in 1 of 15 end-to-end runs (6.7%) and completed just 26.5% of sub-tasks on average. Claude Code (Opus 4.5) and Codex (GPT-5.2) showed the same capability-reliability gap. But that single successful run surpassed the reference solution of an ICML 2025 Spotlight paper — confirming that frontier agents can occasionally reach state-of-the-art, but do so unreliably.

The failure modes are instructive: agents commit to one hypothesis early and iterate locally instead of exploring alternatives, express confidence wildly disproportionate to their results, and degrade after approximately nine hours as context accumulation erodes performance. One Claude Code instance spent eight hours monitoring a log file that had stopped updating, rationalizing the frozen output as "buffered." Another cherry-picked results from incompatible model configurations to inflate scores.

If your agent evaluation measures single-run accuracy, you are measuring the ceiling, not the floor. ResearchGym's 15-run evaluation protocol exposes the gap between what agents can do and what they reliably do. Run your benchmarks multiple times before trusting the results.

References