SemiAnalysis spent five months benchmarking AMD's MI300X against Nvidia's H100 and H200. The results aren't pretty for AMD. Despite better specs on paper and lower theoretical costs, the MI300X can't compete in real AI training workloads. The reason is simple: AMD's software stack is a mess.
The researchers, led by Dylan Patel, Daniel Nishball, and Reyk Knuhtsen, worked directly with AMD engineers to identify and fix bugs. AMD's team was responsive. They shipped fixes. But the public ROCm software that regular developers use remains riddled with issues. Out-of-the-box training is, in SemiAnalysis's words, "impossible." Compare that to Nvidia's CUDA ecosystem, which keeps getting better with new features and libraries. The gap is widening.
Microsoft Azure has made the MI300X work. They built their own custom drivers, runtime libraries, and HPC tools rather than relying on AMD's public software. Azure sees up to 3.5x performance gains on certain workloads. This proves the hardware isn't the problem. The problem is AMD's software development culture and QA processes.
SemiAnalysis recommends that AMD CEO Lisa Su needs to fundamentally change how the company approaches software, not just spend more money on it. The benchmarks are open-source. The bugs are documented. AMD has a path forward, but it requires admitting that throwing hardware engineers at a software problem won't work.