Kimi K2.6 Codes Autonomously for 12 Hours, Matches GPT-5.4

Moonshot AI just open-sourced Kimi K2.6. Early benchmarks put it toe-to-toe with GPT-5.4 and Claude Opus 4.6 on coding and parallel agent tasks, backed by enterprise evaluations from Vercel, Ollama, and others. Jerilyn Zheng, PM for Vercel AI, reports more than 50% improvement on their Next.js benchmark versus K2.5, putting it "among the top-performing models on the platform." CodeBuddy's internal eval team measured a 12% bump in code generation accuracy and 18% better long-context stability. Tool invocation success rate: 96.6%.

The demos are where things get wild. K2.6 spent 12 hours autonomously deploying the Qwen3.5-0.8B model locally on a Mac, writing inference code in Zig (a niche language most humans haven't touched), and cranked throughput from roughly 15 to 193 tokens per second. That's about 20% faster than LM Studio. In another run, it overhauled exchange-core, an aging open-source financial matching engine. Over 13 hours, it modified more than 4,000 lines of code across 12 optimization strategies and pulled out a 185% throughput leap on medium workloads. It even reconfigured the core thread topology after analyzing CPU flame graphs, which is the kind of thing senior systems engineers do.

Bola Malek, Head of Labs, says K2.6 "excels on coding agent tasks at a level comparable to leading closed source models." Robert Rizk, a cofounder and CEO testing the model, calls it a new bar for "long-horizon, coding agent workflows." The 96.6% tool invocation success rate matters more than most benchmark scores here. If an agent can't reliably call tools across thousands of operations, it can't do real work. K2.6 made over 4,000 tool calls in the Zig project alone and kept going.

The Hacker News crowd is already calling this another "DeepSeek moment," with commenters noting Chinese AI labs are now neck-and-neck with top US models and missing the frontier by only days. Whether that framing holds up under broader testing is another question. But the pattern is real: open-source models that sustain multi-hour autonomous coding sessions aren't a novelty anymore, as demonstrated by parallel agent swarms like Imbue's. They're arriving on schedule.