Anyone using AI coding tools knows the feeling. You ask the model to fix a simple off-by-one error, and it rewrites half the function, adds input validation you didn't request, renames your variables, and throws in a helper function for good measure. A new research article from nrehiew.github.io puts a name to this: over-editing. And it's more than annoying. It makes code review harder and quietly degrades codebase quality over time. Unlike correctness failures, over-editing is invisible to test suites. Tests pass. The diff tells a different story. AI Doubles Code Output. Your Reviewers Can't Keep Up. Researchers propose two metrics to measure this behavior: Token-level Levenshtein Distance and Added Cognitive Complexity. Using 400 programmatically corrupted problems from BigCodeBench, they benchmarked major models. Claude Opus 4.6 comes out looking good, with a normalized Levenshtein score of 0.060 and minimal added cognitive complexity. GPT-5.4 lands at the other end with 0.395 Levenshtein distance and 2.313 added cognitive complexity in reasoning mode. That's a massive gap. Same bug, very different patches. Both models fix bugs, but GPT-5.4 rewrites substantially more code to do it. Some users report trimming 80% of AI-generated code with minimal functionality loss. The 85-token caveman prompt shows that models can already understand conciseness, needing only permission rather than lengthy tutorials to perform better. Others shared horror stories of database wipes and credential exposure from over-eager agents touching files they shouldn't. The good news: explicit prompting and targeted training both reduce over-editing. Models can learn to make minimal edits without sacrificing correctness. But until evaluation metrics go beyond Pass@1 and pricing models stop rewarding verbosity, don't expect your AI coding assistant to fix just the bug you asked about.
GPT-5.4 fixes bugs by rewriting your code. Claude doesn't.
A new research article defines 'over-editing' in AI coding models, where LLMs rewrite more code than necessary when fixing bugs. The paper introduces Token-level Levenshtein Distance and Added Cognitive Complexity to measure this behavior, and benchmarks major models including GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro Preview. Claude Opus 4.6 performs best with minimal edits, while GPT-5.4 shows the highest tendency to over-edit. Explicit prompting and training techniques can reduce the problem.