Winfunc Research built N-Day-Bench to answer a straightforward question: can today's language models find real security bugs in real code? The benchmark tests frontier LLMs against "N-Day" vulnerabilities, flaws that were publicly disclosed after each model's training data cuts off. A three-agent pipeline runs the show. The Curator pulls ground truth from security advisories. The Finder, the model being tested, gets 24 shell steps to explore the codebase and write a structured report. It never sees the actual patch. Then the Judge scores the result. Test cases rotate monthly to prevent models from memorizing answers.

The April 2026 numbers are worth looking at. OpenAI's GPT-5.4 leads at 83.93, with Z-AI's GLM-5.1 at 80.13 and Anthropic's Claude Opus 4.6 at 79.95. Moonshot AI's Kimi K2.5 landed at 77.18. Google's Gemini 3.1 Pro Preview trailed the pack at 68.50. The benchmark scanned 1,000 advisories but only accepted 47 as test cases, a fairly strict filter. All interaction traces are public and browsable, so anyone can verify how a model reached its conclusions.

Not everyone is sold on the methodology. Community feedback on the project notes that the Judge is itself a language model, which means scoring noise is a real concern without manual review on top. Several commenters suggested splicing in codebases with no known vulnerabilities to measure false-positive rates, something the current setup doesn't do. If you're tracking AI agent capabilities in security research, those gaps matter. A model that finds bugs but also hallucinates them constantly isn't as useful as raw scores suggest.