GPT-5.5 catches Mythos in security benchmarks

The UK's AI Security Institute just pulled the air out of Anthropic's balloon. New testing shows OpenAI's GPT-5.5 performs about as well as Anthropic's Mythos Preview on cybersecurity benchmarks, scoring 71.4% on AISI's Expert-level tasks compared to Mythos's 68.6%. That difference falls within the margin of error. The AISI ran both models through 95 Capture the Flag challenges testing reverse engineering, web exploitation, and cryptography skills. GPT-5.5 even solved a Rust binary disassembler task in 10 minutes flat with zero human help, at a cost of $1.73 in API calls.

Both models did something no AI had done before on "The Last Ones" test, a 32-step simulated data extraction attack. GPT-5.5 pulled it off in 3 of 10 attempts. Mythos managed 2 of 10. Previous models never succeeded once. Neither model passed the "Cooling Tower" simulation, which models an attack on power plant control software. Nothing has beaten that one yet.

AISI's takeaway: Mythos isn't some unique breakthrough. Its cybersecurity chops are "a byproduct of more general improvements in long-horizon autonomy, reasoning, and coding." That's a direct hit to Anthropic's positioning. The company restricted Mythos Preview's initial release to "critical industry partners" and talked up the cybersecurity risk. OpenAI CEO Sam Altman called that approach "fear-based marketing" on the Core Memory podcast, adding that saying "we have built a bomb" and then selling the bomb shelter is "incredible marketing."

But OpenAI plays the same game. The company limits its own cybersecurity models, GPT-5.4-Cyber and the upcoming GPT-5.5-Cyber, to vetted defenders through its Trusted Access program. Meanwhile, the real story is reinforcement learning with verification (RLVR). The technique trains models by checking their work against verified solutions, then feeding those results back into training. It's driving rapid gains in what agents can do, and organizations are building specialized cybersecurity models with RLVR without any public disclosure or safety restrictions.