ZML just dropped zml-smi, a monitoring tool that works across NVIDIA GPUs, AMD GPUs, Google TPUs, and AWS Trainium chips. If you've ever juggled nvidia-smi, nvtop, and vendor-specific utilities across different machines, you can see the appeal. One binary, every accelerator. The tool pulls utilization, temperature, memory usage, and process info across all platforms.
Where it gets interesting: AMD support. AMD's library expects a hardware ID file at a specific system path, but ZML wanted a fully sandboxed tool that doesn't touch anything outside its own directory. Their solution? Intercept fopen64 calls through a custom shared object and redirect file access to a bundled copy. It works, but Hacker News commenter mrflop called it "a brittle hack masquerading as sandboxing" and questioned why ZML didn't upstream their work to nvtop instead of fragmenting the ecosystem.
That criticism lands. nvtop already supports TPUs through libtpuinfo, as that library's developer rdyro pointed out. Build a separate tool and that's another thing for DevOps teams to track. But ZML's founding philosophy, per CEO Hossein Zare, is hardware agnosticism. Owning their tooling lets them support new hardware like AMD's Ryzen AI Max+ 395 before official ROCm releases catch up. The real question: do you need universal monitoring badly enough to add another tool to your stack?