Local AI's real argument was never the benchmark

Alex Ellis has receipts, and they don't say what the headline says.

The founder of OpenFaaS spent close to US$12,000 on a single Nvidia RTX PRO 6000 Blackwell card with 96GB of memory, ran open-weight models on it across his infrastructure business for the better part of a year, and on 17 June published a long, unusually honest account of what he got for the money. His conclusion is in the title: local Qwen isn't a worse Opus, it's a different tool. The interesting part is buried below it. The work that paid for the card had almost nothing to do with how good the model is at coding.

The pitch

The popular version goes like this. A small open-weight model now scores within a rounding error of the frontier, so a six-year-old GPU under your desk can retire your US$200-a-month coding plan. The numbers give it some cover. Alibaba's Qwen3.6-27B, a dense 27-billion-parameter model shipped under an Apache 2.0 licence, posts a SWE-bench Verified score of 77.2 against Claude Opus sitting in the high 80s. That is a gap you can squint past, and plenty of people have, declaring local "only twelve per cent behind SOTA" and posting one-shot demos to prove it.

Ellis is not that person, which is what makes the post worth reading. He sells privacy and sovereignty infrastructure for a living and badly wants local models to win, and he still says flatly that Qwen is nowhere near Opus levels. So why is the card still running?

What the receipts actually show

Two jobs, neither of them coding. The first: a renewal came up, Ellis fed the customer's telemetry database into the local model, and it surfaced that the customer had been under-reporting licences and under-paying by four to five times for over a year. That single recovery paid for the card. The second: customers email in a diagnostic dump from their OpenFaaS install, and his team runs it through the model inside an airgapped, ephemeral VM.

Both jobs share one property. Ellis would not put that data through any cloud plan at any retention setting. As he puts it, even the 30-day retention you can configure on a ChatGPT Pro or Claude Max contract "likely invalidates your contracts with customers." The model didn't win because it was clever. It won because it was the only model allowed in the room.

That reframes the whole debate. The discourse treats local versus cloud as a capability gap that time and better quantisation will close. Ellis's receipts say the deciding variable was never capability. It was control: whose hardware, whose retention policy, whose decision the day the model gets pulled. On capability he is brutal about the limits. He watched Qwen read 27.3K as 273,000, invent a customer's churn risk, and loop on a trivial CLI task for half an hour, burning 600 watts while it reprinted the same five commands. He won't hand it long-horizon unsupervised work, and he says so plainly.

The strongest case against

Here is the argument that undercuts him, fairly stated. For almost everyone, renting the frontier is simply cheaper and better, and the card is a US$12,000 answer to a question most people don't have. The cost calculators back this up: below roughly 50 million tokens a month, hosted APIs win on price, and a budget frontier model like DeepSeek V4 undercuts a self-hosted rig by an order of magnitude before you've paid a cent for electricity, NVLink bridges, or the weekend you spent compiling llama.cpp from source. The looping and the arithmetic failures aren't quirks of a separate category of tool; they're what being worse looks like. And the privacy moat is eroding from the other side, as Bedrock, Azure and on-prem frontier deployments court exactly the regulated customers Ellis is describing. He runs a sovereignty-infrastructure company, the objection finishes, so of course he reaches for the sovereign answer. He is an edge case rationalising a luxury purchase.

Most of that is correct, and most of it is beside the point. The cost camp is right about the median developer and wrong about the constrained one. Ellis isn't claiming the card pencils out for a hobbyist running it at single-digit tokens per second; he's claiming it pencils out for a business holding customer data it is contractually forbidden to hand to a third party. For him that is an obligation written into a contract, not a personal taste for privacy. And the vendor-risk half of his argument stopped being hypothetical recently: Anthropic's Fable 5 model was pulled from foreign users overnight, which is precisely the "what if the frontier labs do X" scenario that a weight file on your own disk immunises you against. The capability ceiling is real, and he bounds it with care. Local is for work that is narrow, supervised and read-heavy: reading and explaining a codebase, chewing through telemetry, running an airgapped diag. The looping is the evidence for his thesis, not against it. A chisel that snaps when you drive nails with it isn't a bad hammer; it's a chisel.

Cost belongs in the same story, even though Ellis is careful not to lean on it. The rented-frontier deal looks cheap partly because it is subsidised, and subsidies move. When GitHub shifted Copilot to token-based billing on 1 June, heavy users watched their agentic bills jump tenfold to fiftyfold overnight for the same work they had been doing the week before. A fixed asset in a spare room carries an ugly upfront price and a boring, knowable running cost. That predictability is its own kind of control.

The part the leaderboard misses

There is a second-order point in the post that almost nobody pulls out. The moment a second person on Ellis's team used the local model, it stopped being a model and became infrastructure. Who is on which llama.cpp instance, how much have they used, which fine-tune, what did it cost at the wall, what happens when they leave. He ended up writing an access provider for the harness, wiring in metered smart plugs to track power draw, and routing between a stable base model and experimental fine-tunes. "This is where local AI turns into an operations problem," he writes, and lists the parts: identity, access control, quotas, routing, monitoring.

That is the real difference between the two tiers, and it is the thing the frontier labs are actually selling. Pay them and all of it vanishes into the invoice. Run it yourself and you own every piece of it, the autonomy and the pager both. The benchmark argument is a sideshow because the benchmark was never the cost. The cost is the operations discipline you sign up for the day you decide the data can't leave the building.

The bet

So here is the falsifiable version. The local tier does not live or die on whether the next Qwen closes the SWE-bench gap to Opus. It lives or dies on whether a frontier lab can neutralise the control argument: a hosted deployment that credibly promises your data trains nothing, never leaves your jurisdiction, and will not be deprecated out from under you mid-contract. The pieces exist today in fragments. Zero-retention windows, private cloud regions, dedicated capacity. None of them yet adds up to a single promise an enterprise lawyer will sign against a no-third-party-processing clause.

The day one does, the local tier collapses back into a hobby and Ellis's US$12,000 card becomes a very loud space heater. Until then the split holds, and the work worth watching isn't the next leaderboard. It's the dull layer underneath: the metering, the routing, the identity glue that turns a GPU in a spare room into something a team can depend on. The model was never the hard part.