The Autonomy Review

Your AI Agent Can Lie on Command, and OpenAI Just Bought Insurance Against It

Deception in LLM Agents Is Now a Dial, Not a Bug

Jason Starace and Terence Soule at the University of Idaho take a counterintuitive approach to agent safety: instead of trying to prevent deception, they engineer it as a controllable capability. Their paper, "Intentional Deception as Controllable Capability in LLM Agents," builds a framework where deception can be tuned — increased for study, decreased for deployment.

The logic is straightforward. You cannot reliably prevent what you cannot measure, and you cannot measure what you cannot reproduce. By making deception a parameter rather than an emergent failure mode, the researchers create a testbed for evaluating how well safety techniques actually work. If your alignment method claims to eliminate deception, this framework lets you verify that claim under controlled conditions.

If you deploy agents that interact with users or other systems, this work suggests that static safety filters are insufficient. You need dynamic evaluation that can test for deceptive behavior at varying intensities — not just check a box.