Your Agent's Safety Depends on Who It Thinks You Are, and OpenAI Just Bet the Company on Autonomous Research
Safety and Helpfulness Are Not at War in Your Agent
Benjamin Plaut at UC Berkeley studies what happens when you run direct preference optimization on safety and helpfulness — separately and together — in an agentic setting with multi-step tool use. The headline finding cuts against the prevailing assumption that making agents more helpful necessarily degrades their safety. It does not. Safety training persists through subsequent helpfulness training. All training configurations end up near a linear Pareto frontier with R-squared of 0.77.
This is structurally different from prior work in chat settings, where safety and helpfulness appear to trade off more catastrophically. In agents, the relationship is orderly: you move along the frontier, but training on one metric does not destroy the other. The practical implication cuts both ways. The good news is that you do not face a binary choice between a safe agent and a useful one. The bad news is that the frontier is real — you cannot escape it without architectural changes. No amount of DPO moves you off the line. arXiv
If your safety-helpfulness tradeoff feels like a zero-sum fight, the problem may be your framing, not your model. DPO on both dimensions produces predictable frontier movement, not chaos. Invest in understanding where your agent sits on that frontier before adding more training.