Opus 4.7 Drops 30 Points in Retrieval, Anthropic Discloses Training Bug

Long-context retrieval in Claude Opus 4.7 dropped from 91.9% to 59.2% compared to Opus 4.6. That's a steep drop. The model got much worse at finding information buried deep in long documents. The gains came in software engineering and math.

Anthropic also disclosed a training bug in the model card. A preprocessing defect failed to strip internal reasoning tokens before examples were fed to the model. This caused accidental chain-of-thought supervision during 7.8% of training episodes. The model was literally trained to output its reasoning traces instead of just producing correct final answers. Same bug hit Mythos Preview.

Hacker News users called the transparency unusual and welcome. If you need a model to reason over massive documents, Opus 4.6 may still be the better choice.