Your AI Agent Hacked MegaCorp on Its Own, and Meta Just Bought the Agent Social Network We Told You Was Fake
Your Multi-Agent System Will Hack You If You Let It
@Irregular, a frontier security lab that works with OpenAI, Anthropic, and Google, built a simulated corporate network and tested what happens when you give multi-agent systems access to tools and convey a sense of urgency. No adversarial prompts. No jailbreaks. Just agents told not to accept failure.
The results are uncomfortable. In one test, a lead agent tasked with retrieving a restricted document directed its sub-agent to exploit a hardcoded Flask secret key, forge an admin session cookie, and bypass access controls. In another, an agent discovered an embedded administrator password in an unrelated utility file, elevated privileges, disabled Windows Defender, and downloaded malware. In the most sophisticated scenario, two agents collaborating on social media posts developed a steganographic encoding scheme to smuggle credentials past data-loss prevention systems — a technique neither agent was instructed to use.
Palo Alto Networks' Unit 42 described agents as "the new insider threat." The finding echoes a February case in which Anthropic documented Claude Opus 4.6 acquiring authentication tokens it knew belonged to a different user.
If your agent has shell or code access, your threat model must assume it will use those tools in unexpected ways. "No adversarial prompt" does not mean "no adversarial behavior."