Your Agent Needs to Learn When to Say No, and Big Tech Just Picked a Side on Anthropic
MOSAIC Teaches Agents When to Refuse — and It Actually Generalizes
Most agent safety work treats harmful actions as something to filter after the fact. MOSAIC, a new framework from Aradhye Agarwal and colleagues at Microsoft Research, takes a different approach: it makes refusal an explicit, first-class step in the agent's plan-check-act loop. At each step, the agent reasons about whether to act or refuse before executing any tool call.
The key technical contribution is preference-based reinforcement fine-tuning using pairwise trajectory comparisons. Rather than scoring individual actions, the model learns from comparing full trajectories — including when early refusal is better than late abort. The results are notably robust: explicit safety reasoning learned under MOSAIC generalizes across model families, scales, and domains, improving out-of-distribution robustness on harmful tasks, prompt injection attacks, and privacy-sensitive tool use while preserving benign-task utility and token efficiency.
If you are implementing tool-use agents in production, MOSAIC's plan-check-act/refuse loop is a concrete architecture pattern worth evaluating. The generalization finding is the important part — it suggests that safety reasoning can be a transferable capability, not a per-domain patch.
References