AI Can Find the Bug. Verifying It Is Still the Whole Job

A security researcher named Kasra Rahjerdi built a book-review app with a deliberately planted hole, handed it to more than a dozen frontier models, and asked each one to find the flag hidden in a private user's reviews. Most of them scored zero. He spent about $1,500 in tokens establishing that.

The planted bug is a textbook case, the kind Rahjerdi says he has found in production more than once: a hardened FastAPI backend sitting in front of a wide-open Firebase data layer, with the Firebase keys shipped inside the app bundle. The intended path is to ignore the well-defended API entirely, register directly against Firebase, and read the database. It is broken access control, the same class auditors variously file under missing object-level authorisation or IDOR. There is one correct answer and a flag that proves you reached it, which makes this a cleaner test than almost anything in the wild.

The results were lopsided. GPT-5.5 solved 7 of 10 runs, at roughly $9.46 per success. DeepSeek V4 Pro managed 3 of 10 for pennies. Both Claude models landed 2 of 10, with Opus getting close several times before its own safety guardrails ended the session. Five of the nine models that completed a full ten runs never solved it once. Google's Gemini 3.1 Pro refused on sight, a fact you can read straight off its token count: nine thousand median tokens a run against a hundred thousand or more for the models that actually engaged. Several of the failures were not quiet. Step 3.7 Flash mapped the API neatly and then announced it had found exploits it had not found. Grok flagged a user reading their own reviews as an IDOR. The losing pattern was rarely "gave up" and often "declared victory over nothing."

The pitch

Set that next to the story the security industry has been telling for a year, and the two look like they cannot both be true. The dominant line is that AI has democratised offensive security. HackerOne calls it the era of the "bionic hacker" and reports a 210% rise in valid AI-related vulnerability reports year over year, with bounties paid for them up 339%. At the sharp end, the AI security firm AISLE found all twelve CVEs in a single OpenSSL release with its agents, and was awarded five CVEs in curl over 2025, including three of the six fixed in one release. The pitch is that the machines are now genuinely finding real bugs in the most scrutinised code on the internet, and getting better fast.

Both pictures are accurate. The job is to see why they do not contradict each other.

The read

What Rahjerdi measured is the shape of the capability curve, and the shape is a cliff. A couple of models clear a real task reliably. A long tail of capable-looking models cannot, and the way they fail is the important part: they produce confident, well-formatted, completely wrong findings. That is the same behaviour, viewed from the attacker's chair, that maintainers have been drowning in from the defender's. The curl project shut its bug bounty after seven years, eighty-one genuine findings and more than $90,000 in payouts, because the inbound had become unreadable. Daniel Stenberg's accounting put AI slop at around 20% of submissions in 2025 and genuine vulnerabilities at about 5%. He called it death by a thousand slops. The slop flood and the capability story are not two trends; they are one capability cliff described from opposite sides.

That reframes where the value actually lives. In Rahjerdi's harness there is an oracle: a flag that is either captured or not, so a wrong answer costs him nothing but compute. The open-source maintainer has no flag. Every inbound report is a candidate that some human has to read, reproduce and disprove, and the cost of that work is identical whether the bug is real or hallucinated. Discovery got cheap. Verification did not. curl did not die from a shortage of real bugs; it died from a surplus of candidates that someone had to check by hand. The binding constraint in machine-assisted security is the triage step, and that is the step the slop submitters skip and the defenders cannot.

This is also why AISLE's results sit comfortably alongside the carnage rather than refuting it. AISLE does not email maintainers a paragraph of suspicion. Its agents are wrapped in a harness that has to demonstrate the vulnerability before anything goes upstream, building its own oracle as it goes. The curl team accepted those findings for the same reason they rejected the slop: the proof came attached. What separates the frontier is the verification scaffold around the model, not a cleverer guess.

The strongest case against

The optimist's reply is fair and worth stating in full. A cliff is a snapshot, and snapshots move. GPT-5.5 going from nothing to a 70% solve rate is what the front edge of a rising curve looks like, one model pulling away from the pack before the pack catches up. AISLE has shown the ceiling is real on code that thousands of human eyes have already combed. And verification is itself an engineering problem: HackerOne is already shipping AI tooling to triage and filter inbound reports, so if a model can find a bug, another system can be built to confirm it. On this view the slop is a transition cost, the unglamorous middle of an S-curve, and betting on a bottleneck is betting against the one thing that has reliably gotten cheaper.

The part of that I would not bet against is the ceiling. AISLE is real, and "AI cannot find serious bugs" is already false. Where I think the optimist is too quick is the assumption that verification automates on the same curve as discovery. It does not, because verification is adversarial against your own tooling in a way discovery is not. The system you would trust to confirm a candidate is drawn from the same population of models that just generated it, and that population is confidently wrong in both directions at once: Step 3.7 declaring phantom exploits, Grok waving through a non-bug, the Firebase models that found the data layer and then tried to attack it through the very API it bypasses. Inti De Ceukelaire of Intigriti describes the failure mode exactly, an AI acting as an echo chamber that lures a researcher into a spiral of confirmation bias. Filtering that with more of the same class of model is not obviously convergent. AISLE's answer was an execution harness that runs the exploit and watches it work: verification by proof rather than by opinion, and expensive and bespoke, which is the opposite of the thing that scales for free.

The second-order tell

Notice which bug Rahjerdi chose. Broken access control and IDOR are not random. HackerOne's own benchmark has improper access control and IDOR up between 18% and 29% year over year, the category where both attackers and defenders are now concentrating, because it resists signature scanners. There is no pattern to grep for. You have to understand what a given user is allowed to do and then prove they can do more. That is business-logic reasoning, the thing LLMs are supposed to be good at, and it is exactly where Wiz's Gal Nagli says fully autonomous agents still struggle, "especially with authentication and scenarios where human context is critical." The bugs that matter most in 2026 are the ones living right at the edge of the cliff, the least comfortable place for an automated verdict to be trusted.

The bet

The metric worth watching is not which model tops the next solve-rate table. It is whether anyone ships verification that scales without a person in the loop. Concretely: if by the end of 2026 a major bug-bounty platform is paying out on agent-submitted reports that no human triaged, machine-found and machine-confirmed end to end, then the bottleneck broke and this read was wrong. Until that happens, every "AI found a vulnerability" headline is carrying a silent second clause. An agent found a candidate, and a human confirmed it was real. The first half got cheap this year. The second half is still the whole job.