Your Agent Searches Without a Strategy, and the EU Rewrites AI Rules Tomorrow
Your Agent Matches Human Accuracy by Searching Harder, Not Smarter
Łukasz Borchmann at Snowflake, Jordy Van Landeghem, and collaborators from Oxford, Hugging Face, UNC Chapel Hill, and other institutions introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. The benchmark evaluates whether AI agents can strategically navigate document collections or whether they rely on brute-force retrieval.
The headline finding is uncomfortable for agent builders: the best agents match human searchers in raw accuracy, but the way they get there is fundamentally different. Humans navigate strategically — selecting documents based on relevance cues, skimming efficiently, and building mental models of document structure. Agents achieve similar scores through exhaustive retrieval — processing more pages, running more queries, and compensating for poor strategy with raw compute. The result: a nearly 20% gap between the best agents and oracle performance persists, and it is not a scale problem. More retrieval does not close it.
The finding connects to a pattern we have tracked all month. Multi-agent systems fail not because individual agents are weak, but because the coordination and reasoning architecture around them is insufficient. MADQA shows the same principle at the individual agent level: raw capability without strategic reasoning hits a ceiling.
If your agent pipeline includes document retrieval, benchmark it on strategic efficiency, not just accuracy. MADQA provides the evaluation framework. Agents that get the right answer by processing everything are expensive and fragile at scale.
References