Google's Eighth-Gen TPUs Split Into Training and Inference Chips

Google just split its TPU chip line in two. The eighth generation of its custom AI silicon will come in separate flavors: one for training models and one for running them. It's a bet that inference, the work of actually serving AI to users, has become important enough to deserve its own purpose-built hardware. The TPU 8i inference chip packs 384MB of SRAM, triple the previous generation, which Google CEO Sundar Pichai said is built to "deliver the massive throughput and low latency needed to concurrently run millions of agents cost-effectively."

This split makes sense now. AI agents with real autonomy burn more compute than simple chatbots. Each interaction needs a fast response, and serving millions at once demands hardware built for that exact job. Agent orchestration has become a central challenge in this space.

The company says the training chip delivers 2.8 times the performance of its seventh-generation Ironwood TPU at the same price, while the inference processor is 80% faster than its predecessor. Google isn't comparing these numbers to Nvidia. That's telling. Nvidia still dominates AI hardware, and every cloud giant building custom chips is also buying Nvidia by the truckload. Anthropic and Citadel Securities are already running on Google's TPUs, along with all 17 U.S. Energy Department national laboratories. DA Davidson analysts pegged the TPU business plus Google DeepMind at roughly $900 billion in value last September.

The inference chip puts Google in direct competition with startups like Groq and Cerebras, which have built their entire businesses around ultra-fast inference. Those companies have real architectural advantages. Groq uses compiler-controlled data flow to avoid memory bottlenecks, while Cerebras puts everything on a single massive wafer. Google's advantage is scale and integration: TPUs are already baked into Google Cloud, and the company has been refining this silicon for over a decade. Other approaches to cost-effective inference include shared GPU cohorts, which split resources among users to lower barriers to entry.

Whether specialized chips or cloud-scale platforms win the inference market is the hardware question that will shape how fast AI agents can actually scale.