Imbue's 100-agent testing swarm finds bugs by watching AI fail

Imbue just published a case study showing how they run over 100 Claude agents simultaneously to test and document their internal tool, mngr. The setup works like this: start with a tutorial script, convert each command block into pytest functions, then assign a dedicated agent to every single test. Each agent runs its test, debugs failures, improves the code, and writes results to a JSON file. At the end, another agent merges everything into one pull request.

What's clever is how they use agent failures as a signal. When a coding agent can't generate a valid example from documentation, that's a red flag that the interface itself is confusing to the agents to the agents. They treat it as a usability test. If an AI struggles, a human probably will too. These struggles become a feedback loop for improving the product.

The architecture differs from single-agent systems like Cognition's Devin where the coding harness matters more than the model. Instead of one autonomous agent working through tasks sequentially, Imbue runs a swarm where each test gets its own agent working in parallel, similar to how CI pipelines distribute jobs. They use Modal for remote execution when scaling beyond local development. The trade-off is coordination overhead. Merging changes from 100 agents into one PR isn't trivial, even for an AI.