Abhishek Ray has run Claude Code workshops for over 100 engineers in the last six months. Teams that used to merge 10 PRs weekly are now merging 40 to 50. Same reviewers. Same hours. A Latent Space analysis found teams with high AI adoption merge 98% more pull requests but spend 91% more time in review. The output doubled. The review burden nearly doubled too. An economics paper by Catalini, Hui, and Wu explains why this is hard to fix: the cost to automate is falling fast, but the cost to verify is biologically bounded. You can 10x your output. You can't 10x your reviewers.

AI-generated code is harder to review because it's too clean. When humans write bugs, they leave traces: weird variable names, confused comments, structure that doesn't fit. AI writes idiomatic, well-commented code. The surface is smooth. The bugs are buried. Reviewers have to dig deeper, not shallower. Then there's the confidence gap. AI-generated code is harder to review because it's too clean. When humans write bugs, they leave traces: weird variable names, confused comments, structure that doesn't fit. AI writes idiomatic, well-commented code. The surface is smooth. The bugs are buried. Reviewers have to dig deeper, not shallower. Then there's the confidence gap. A frontend engineer asks Claude to write a database query, gets back something that looks correct, and isn't qualified to know if it is. A METR study from mid-2025 found developers thought AI made them 20% faster. It actually made them 19% slower on measurable tasks. Clean output feels like progress even when it isn't.

Ray, who builds verification tooling at Opslane, says the fix requires three things: tests as a foundation, human-written acceptance criteria before AI starts, and agents verifying agents. "Build a login page" is a prompt. "Users authenticate with email and password, receive a specific error on wrong credentials, land on /dashboard, session expires after 24 hours" is something a machine can check. One CTO he spoke with has all three layers running. Critical changes get human eyes. Routine work closes automatically.

Legacy tools like SonarQube and Snyk are scrambling to catch subtle logic bugs. Newer platforms like Graphite, Qodo, and Rivet AI are targeting the review bottleneck directly. Companies like Cognition and Opslane are pushing toward autonomous agents that execute and interact with code in real time instead of relying on static analysis.

The common objections don't hold up. AI writes tests too? Same blind spots. Same agent, same context gaps. Just hire more reviewers? Experienced reviewers are scarce, and asking senior engineers to manually review AI-generated boilerplate wastes their time. The real bottleneck is upstream, figuring out what to build? Valid, but a separate problem. Even if you're building exactly the right things, you still need to know what shipped matches what you intended. Teams that skip verification infrastructure are accumulating what an MIT paper calls the "Trojan Horse" externality: deploying unverified systems becomes rational for each team even as systemic risk grows.