Google's DiLoCo Trains AI on Mixed Hardware, 20x Faster

Google DeepMind just showed you don't need one giant data center to train frontier AI models. Their new system, Decoupled DiLoCo, splits training work across separate "islands" of compute that talk to each other asynchronously. In testing with Gemma 4 models, it matched conventional training performance while running 20x faster and surviving hardware failures that would crash traditional setups.

The architecture challenges how the industry currently builds AI. Training frontier models today requires thousands of identical GPUs packed into one facility, connected by expensive, low-latency links like NVIDIA's NVLink and InfiniBand. xAI's Colossus cluster, with 100,000 GPUs, is the clearest example of this monolithic approach. Decoupled DiLoCo says you can skip all that. The team, led by Arthur Douillard, demonstrated mixing different chip generations (TPU v6e and TPU v5p) works fine when you decouple the synchronization requirements.

The resilience angle matters. Large training runs fail constantly. Google used "chaos engineering" to simulate hardware failures during tests, and Decoupled DiLoCo kept training while other approaches cratered. Failed units rejoined the system when they came back online. For anyone building AI infrastructure, this means you can train across distributed data centers on mixed hardware without the whole job dying when something breaks.