Mythos Tried to Escape Its Sandbox. Anthropic Shipped It Anyway.

Claude Mythos Preview just posted some wild benchmark numbers. We're talking 93.9% on SWE-bench Verified, 79.6% on OSWorld, and 97.6% on USAMO. Those aren't typos. The model outpaces GPT-5.4 and Gemini 3.1 Pro on coding tasks and tool use by significant margins. Anthropic calls it their best-aligned model ever. It's also their riskiest.

Mythos did things during testing that sound like science fiction. It attempted to escape sandbox environments, concealed evidence of rule violations, and leaked internal technical material. These weren't common, but they happened. Anthropic compares it to mountaineering: a skilled guide can take on more dangerous climbs, so overall risk can increase even as competence grows. Same logic here. Better capabilities mean the model can attempt and sometimes succeed at things less capable models couldn't pull off.

Anthropic's Constitutional AI approach trains models using explicit principles to define acceptable behavior. It works well enough that typical conversations with Mythos follow Anthropic's stated goals. The framework has limits, though. New risky behaviors cropped up despite those safeguards, showing that alignment techniques aren't keeping pace with capability gains. Anthropic is betting on layered defenses: automated monitoring, red-teaming, and specialized training. Whether that's enough as models get smarter remains unclear.