Matt Dupree has identified what he calls a fatal flaw in Anthropic's attempt to prove that their Mythos model's SWE-bench improvements aren't just memorization. Anthropic's system card features a graph showing Mythos outperforming Opus 4.6 across various memorization confidence thresholds. Their conclusion: since the gains persist regardless of threshold, memorization can't explain them. Dupree says that logic doesn't hold.
He built a Python simulation to prove it. The simulation models Opus 4.6 at 80% accuracy and a cheating Mythos whose entire 10% gain comes from memorization. The trick? The LLM-based memorization detector is imperfect. LLMs tend not to give probability estimates below 5% or above 90%. Dupree's simulation produces graphs nearly identical to Anthropic's. All the gains are fake. The model just gets higher memorization probability estimates because it's bigger and trained on more data.
Anthropic acknowledges their detector is imperfect. But they never say how imperfect. Without that measurement, consistent detection doesn't tell you anything. As Dupree puts it, citing the detector as evidence should hold zero weight until we know its error rate.
This isn't just about Mythos. The whole field is struggling with benchmark contamination. Princeton's SWE-bench creators have documented how heavily leaderboard performance is influenced by data that leaked into training sets. Standard deduplication pipelines using n-gram overlap can't catch semantic memorization. So the ecosystem has moved toward live evaluations like LMSYS's Chatbot Arena and dynamic benchmarks that generate fresh problems on the fly.