Google DeepMind has a new way to train large AI models across distributed data centers. Called Decoupled DiLoCo, the system splits training into separate "islands" of compute that communicate asynchronously. If hardware fails in one island, the rest keep training. Failed components reintegrate automatically when they come back online. In testing with Gemma 4 models, the approach matched conventional training performance while using far less bandwidth and surviving hardware failures that would tank traditional setups.

The dominant approach to AI training right now relies on building massive centralized superclusters. OpenAI and Meta pack tens of thousands of GPUs into single facilities, connected by expensive high-speed interconnects like NVIDIA's NVLink and InfiniBand. Everything must stay in near-perfect synchronization, which requires billions in infrastructure and creates a huge barrier to entry.

Decoupled DiLoCo demonstrates you don't need that tight coupling. Google DeepMind trained a 12 billion parameter model across four US regions and achieved the same ML performance as conventional methods, 20 times faster and with orders of magnitude less bandwidth.

Not everyone is convinced this is a game-changer. Distributing work across distant clusters and combining results isn't new. Similar patterns have existed in non-AI distributed computing for years. But applying it to LLM training with the right algorithm modifications is what makes this meaningful. Organizations without billions to spend on single-site supercomputers could still train competitive models.

NVIDIA's competitive moat depends partly on bundling high-performance networking with its silicon. If distributed training like DiLoCo becomes standard, that advantage narrows. The edge shifts from who builds the biggest centralized cluster to who writes the best distributed training software. Good news for everyone except companies selling expensive interconnect hardware.