DeepSeek open-sourced its speculative-decoding stack and claims up to 80% faster generation

DeepSeek has released DSpark, a speculative-decoding module that bolts onto its V4 Flash and Pro checkpoints, and open-sourced DeepSpec, an MIT-licensed codebase for training and evaluating the draft models that make the technique work.

DSpark is not a new model; it is the same weights with a small "draft" model attached that proposes tokens the big model then verifies in parallel. DeepSeek reports acceptance lengths up 16.3% to 30.9% over rivals Eagle3 and DFlash, and throughput gains of 51% to 400% with lower latency. The buried cost sits in DeepSpec's own README: building the target cache needed to train a draft for a model as small as Qwen3-4B takes roughly 38 TB of storage.

Speculative decoding is becoming a commodity layer anyone can train against Qwen or Gemma, but the storage and compute to produce a good draft model keep it well out of hobbyist reach.