Xiaomi pushes a 1-trillion-parameter model to 1000 tokens a second on commodity GPUs

Xiaomi's MiMo team says it has pushed a 1-trillion-parameter model past 1000 tokens per second of decode speed, peaking near 1200, on a single standard eight-GPU node.

The trick is avoiding exotic hardware. Where Cerebras uses wafer-scale chips and Groq uses on-chip SRAM, MiMo and its systems partner TileRT stayed on commodity GPUs through two moves: FP4 (MXFP4) quantisation applied only to the Mixture-of-Experts weights, which carry most of the parameters and tolerate it best, and a block-level speculative decoder called DFlash that accepts an average of 6.3 of every 8 drafted tokens on coding tasks. The UltraSpeed API costs three times the standard MiMo price for roughly ten times the speed.

Speed at this scale changes what a model is for. Xiaomi pitches running dozens of reasoning paths in parallel and dropping trillion-parameter models into real-time loops, and the coding-agent case is the obvious one: less waiting on the model, more iterating with it. The FP4-DFlash checkpoint is on Hugging Face, so the claims are checkable.