Stanford: AI Aces Math Olympiad, Fails at Analog Clocks

Stanford's 2026 AI Index Report dropped this week, and the big story is how fast the landscape has shifted. The U.S.-China gap in AI model performance has basically closed. DeepSeek-R1 briefly matched the top U.S. model in February 2025, and as of March 2026, Anthropic's best model leads by just 2.7%. That's razor thin. The U.S. still produces more top-tier models and higher-impact patents, while China leads in publication volume, citations, and patent output overall. South Korea leads the world in AI patents per capita, according to the report.

Researchers call this the "jagged frontier" of AI. Gemini Deep Think earned a gold medal at the International Mathematical Olympiad. Impressive. But the top model reads analog clocks correctly just 50.1% of the time. The rise in AI agents saw task success jump from 12% to roughly 66% on OSWorld, which tests agents on real computer tasks. They still fail about one in three attempts. We have models that can do PhD-level science but can't reliably tell time. The gap between what AI excels at and what it struggles with keeps getting weirder.

Safety isn't keeping pace with capability. Documented AI incidents rose to 362 in 2025, up from 233 in 2024, according to Stanford HAI. Most frontier model developers report capability benchmark results, but responsible AI benchmark reporting remains spotty. Research also found that improving one dimension, like safety, can degrade another, like accuracy. The trade-offs are real and underreported.

GPT-5 mini uses three times the energy of GPT-4o, contradicting the assumption that smaller models would be more efficient. The reason is inference-heavy architecture. These mini models compensate for fewer parameters with extended chain-of-thought reasoning, multiple sampling attempts, and more computation per token. Eight medium-length messages consume about 10Wh, roughly two phone charges. Model size alone doesn't tell you the environmental cost. Inference efficiency matters more than parameter count.