The Autonomy Review

Your Agent Pays a Tax Every Time It Plays It Safe, and San Francisco Just Marched on the Industry

Your Agent Pays a Tax Every Time You Make It Safer

"The Autonomy Tax" introduces a finding that should concern every team shipping safety-aligned agents: defense training — the alignment techniques designed to make LLMs refuse harmful requests — systematically degrades agent task performance. The degradation is not a side effect that careful tuning can eliminate. It is structural.

The core problem: current defense evaluation relies on single-turn benchmarks that measure whether a model refuses harmful prompts in isolation. These benchmarks do not capture how safety training interacts with multi-step tool use, environmental interaction, and planning — the capabilities that define agents. When a model trained to be cautious encounters ambiguous tool calls or uncertain environmental states, it over-refuses or hesitates in ways that break task completion. The result is an autonomy tax: a measurable cost to agent capability that scales with the strength of defense training.

We covered the safety-helpfulness Pareto frontier on Friday — Benjamin Plaut's finding that the two properties exist in a linear tradeoff, not zero-sum conflict. The Autonomy Tax extends this: even when safety and helpfulness coexist along the frontier, the current shape of defense training imposes costs that are invisible to standard safety evaluations. You cannot fix what your benchmarks do not measure. arXiv:2603.19423

If you evaluate your agent's safety on single-turn refusal benchmarks and its capability on task completion benchmarks separately, you are missing the interaction effect. Test safety and capability jointly, in agentic settings, with multi-step tool use.