Your Agent Gets Smarter Without Getting More Reliable, and China's Open-Source AI Strategy Just Hit a Wall
Eighteen Months of Capability Gains, Almost No Reliability Improvement
Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan at Princeton propose twelve concrete metrics decomposing agent reliability into four dimensions: consistency, robustness, predictability, and safety. Grounded in safety-critical engineering from aviation, nuclear power, and automotive systems, this is the most rigorous reliability evaluation of AI agents published to date.
The findings are uncomfortable for anyone deploying agents in production. Evaluating 14 models across two benchmarks, the researchers find that recent capability gains have produced only modest reliability improvements. Claude Opus 4.5 and Gemini 3 Pro score best at 85% overall reliability — but the sub-metrics tell a different story. Gemini 3 Pro scores just 52% on calibration (knowing when its answers are likely accurate) and 25% on avoiding catastrophic mistakes. Claude Opus 4.5 is the most consistent model tested, but still only 73% consistent across runs. The interactive dashboard at hal.cs.princeton.edu/reliability/ makes the full picture navigable.
If your evaluation pipeline measures accuracy without separately measuring reliability, you are missing the failure modes that matter in production. These twelve metrics are a ready-made checklist. Start with consistency and catastrophic error avoidance.