Your Agent Benchmarks Still Miss the Point, and the White House Just Drew a National AI Line
AgentProcessBench: step labels for tool-using trajectories
Shengda Fan and colleagues introduce AgentProcessBench, a benchmark built explicitly for diagnosing process quality in tool-using agents: 1,000 trajectories and 8,509 human-labeled step annotations, with a ternary scheme and an “error propagation” rule to reduce labeling ambiguity.[1]
The important point is not the dataset size. It is the framing. In tool-augmented settings, errors are not like algebra mistakes you can backtrack from. Tool calls can change state. Bad actions create side effects. That makes step-level verification a precondition for reliability, not a research nicety.
If your eval stack only logs “task success,” you will systematically miss the failure modes that cause the worst incidents. Start instrumenting trajectories and scoring steps, not just outcomes.