Why Your AI Inference is Slow: You're Fighting the Hardware

The hardware inside your laptop is absurdly fast. Your software probably isn't. That gap between what silicon can do and what code actually does is the subject of a new article by Caer Sanders on Martin Fowler's site, and it's worth your time if you build systems that need to move data quickly. Sanders walks through "mechanical sympathy," a term borrowed from Formula 1 champion Sir Jackie Stewart and brought to software by Martin Thompson back in 2011. The idea is simple: understand how your hardware actually works, and write code that plays to its strengths.

The principles are concrete and practical. Memory isn't flat. CPUs cache data in layers (registers, L1, L2, L3, then RAM), and predictable sequential access blows random access out of the water. Sanders points out that a sequential scan over a source database in an ETL pipeline will consistently beat querying entries one at a time by key.

Cache lines are 64-byte chunks, and when two CPU cores write to different variables sitting in the same chunk, they fight over it through the shared L3 cache. That's false sharing, and it causes latency to scale linearly with thread count. Pad those cache lines with empty space and latency stays flat. Then there's the single-writer principle: if something gets written to, only one thread should do the writing. Mutexes and locks are expensive, and avoiding them entirely beats optimizing them every time.

This matters for AI agent developers because inference serving is exactly where these bottlenecks show up. Most AI runtimes handle one inference call at a time. Sanders describes a naive HTTP service that wraps an ONNX embedding model behind a mutex; when requests pile up, they queue behind that lock and hit head-of-line blocking. The mechanically sympathetic approach, assigning all writes to a single thread and batching requests naturally, sidesteps the problem. Sanders used these ideas to build AI inference platforms at Wayfair serving millions of products, and the LMAX Architecture proved years ago that a single Java thread could process millions of events per second when you stop fighting the hardware. If you're building agent infrastructure that needs to serve real traffic, these aren't optional optimizations. They're the basics.