Jacky Liang, newly OpenRouter's dev-rel lead, dropped eleven large language models into a 2D battle royale and made them play 30 games. Grok 4.1 Fast won 13 of them at US$0.97 per win. The next best, Claude Sonnet 4.6, took 5 wins at US$26.78 each, a 27x gap on the metric a routing customer actually pays for.

The detail worth keeping is that the model which won was not the model that fought best. GPT 5.4 racked up 38 kills across the 30 games and still never placed first. Claude, meanwhile, spent its matches broadcasting its own position, asking rivals to team up and trying to make friends, behaviour that loses a deathmatch but is closer to what you want when an agent is wired into a workflow full of other people.

Liang's point is that the leaderboard and the use case pull apart. A model can be cheap and lethal in a closed game and still be the wrong default everywhere agents touch humans. Worth reading before you pick a model off a benchmark table and assume the ranking transfers.