Simon Willison's 'pelican riding a bicycle' SVG benchmark just produced a result nobody expected. Alibaba's Qwen3.6-35B-A3B, running locally on his MacBook Pro M5 through LM Studio, drew a better pelican than Anthropic's latest. Opus couldn't even get the bicycle frame right. Willison tested a backup prompt too, 'flamingo riding a unicycle,' and gave that one to Qwen as well, partly because the model slipped a sunglasses comment into its SVG code.
Willison is the first to admit his pelican benchmark has always been a joke. A funny one, and historically it correlated with overall model quality. That correlation is now broken. He doesn't think a 21GB quantized model running on a laptop is more useful than coding benchmarks for general tasks. The numbers agree. On coding benchmarks, Qwen 3.6 35B solves 11 out of 98 tasks. Opus 4.7 nails 95 out of 98. Not close.