Qwen just dropped a model that claims to beat their previous flagship at coding, and it's 14 times smaller. Qwen3.6-27B weighs in at 55.6GB compared to Qwen3.5-397B-A17B's 807GB. The claim: it surpasses that larger model across all major coding benchmarks. That's a big deal if true, because it means you can run something competitive with massive models on consumer hardware. A recent Qwen-3.6-Plus model also demonstrated record-breaking token throughput.

Simon Willison tested the 16.8GB quantized GGUF version from Unsloth's Hugging Face repository using llama-server. He reported generation speeds of 25.57 tokens per second locally using advanced quantization techniques to fit the model on a consumer system. The model handled creative tasks too, generating SVG images including a pelican riding a bicycle and a North Virginia opossum on an e-scooter. Willison called the pelican result "outstanding for a 16.8GB local model." That one produced 4,444 tokens in about 3 minutes. The opossum ran longer at 6,575 tokens over 4 and a half minutes.

The architecture difference matters here. Qwen3.6-27B is a dense model, meaning it uses all its parameters on every query. The bigger Qwen3.5-397B-A17B is a mixture-of-experts (MoE) design that only activates a fraction of its parameters at once. MoE gets more performance per active parameter, but you still have to store and load the whole thing. A dense model matching an MoE at a fraction of the size would change what's practical to run on your own hardware.

Willison's hands-on focused on creative generation, not the coding benchmarks Qwen is boasting about. No independent benchmark results have surfaced yet. But for anyone wanting to try it now, Willison shared his full llama-server configuration on his blog, crediting a recipe from Hacker News user benob.