Something's been nagging at me since July 2025.
A research organisation called METR published a study that should've been headline news. Randomised controlled trial. Gold-standard methodology. Experienced developers on real codebases using the tools we're all being told will revolutionise software development.
The result? AI tools made developers 19% slower.
Not faster. Slower.
And here's the bit that really gets under my skin: those same developers estimated they were 20% faster. They felt the velocity. They experienced what they believed was productivity. The stopwatch said otherwise.
That 39-percentage-point gap between perception and reality is one of the most significant findings in software engineering research this decade. Yet somehow, GitHub's still pushing "55% faster" stats. Vendors are still claiming productivity revolutions. And most developers I talk to have never heard of this study.
Let's fix that.
The Study That Changes Everything
METR (Model Evaluation and Threat Research) isn't some random blog or vendor-funded marketing department. They're a nonprofit research organisation based in Berkeley, California, founded in 2022 and spun off from the Alignment Research Center in 2023. Their job is evaluating AI systems with rigorous scientific methods (METR About Page, 2025).
Their developer productivity study, published in July 2025, used a randomised controlled trial design. That's the gold standard for establishing causation, not correlation. The same methodology used in medical trials where getting it wrong has life-or-death consequences.
Here's how it worked:
They recruited 16 experienced developers from major open-source projects. These weren't beginners or intermediate coders. These were maintainers of repositories averaging 22,000+ GitHub stars and over 1 million lines of code. Developers who'd contributed to their codebases for multiple years (METR Study, July 2025).
Each developer provided lists of real issues they'd normally tackle. Bug fixes. Feature implementations. Refactoring tasks. Not toy problems or synthetic benchmarks. Real work they'd actually do anyway.
Then came the randomisation. Each of the 246 total issues was randomly assigned to either "AI allowed" or "AI not allowed" conditions. When AI was permitted, developers could use whatever tools they preferred. Most chose Cursor Pro with Claude 3.5 or 3.7 Sonnet. The frontier models. The best available at the time.
Developers recorded their screens (140+ hours of footage) and self-reported completion times. The researchers had everything they needed to calculate actual productivity.
Before starting, developers predicted AI would reduce completion time by 24%.
After completing the study, they estimated AI had actually saved them 20%.
The measured reality: AI increased completion time by 19%.
That's not a rounding error. That's not within the margin of statistical noise. That's a 39-percentage-point gap between what developers believed was happening and what actually happened.
But Everyone Says AI Makes You Faster
I know what you're thinking. What about all those other studies?
GitHub's Copilot research claimed developers complete tasks 55% faster (arXiv Paper 2302.06590, 2023). JetBrains' 2025 survey found 74% of developers cited "increased productivity" as a top benefit (JetBrains State of Developer Ecosystem, 2025).
Here's the difference: those studies either measured single standardised tasks (GitHub's HTTP server in JavaScript) or relied on self-reported surveys. Developers were asked how they felt about productivity, not measured on objective outcomes.
METR measured both perception and reality. Same developers. Same tasks. Same conditions. The gap between what people believe and what actually happens is the entire point.
The other studies aren't worthless. They tell us something important: developers genuinely believe AI helps. That belief is real, even when the underlying productivity gain isn't.
The Psychology of "Dark Flow"
There's a concept in productivity research that keeps coming up when I read analysis of this study. Researchers are calling it "dark flow" or "junk flow."
You probably know Csikszentmihalyi's original flow research. That optimal state where you're completely absorbed in a task, time disappears, and you're operating at peak performance. Flow is usually associated with creativity, productivity, and satisfaction (Mihaly Csikszentmihalyi, "Flow: The Psychology of Optimal Experience", 1990).
Dark flow is the shadow version. The feeling of productivity without the actual productivity.
Think about what happens when you're using an AI coding assistant. Code appears on your screen rapidly. You're not typing as much. The cognitive load of remembering syntax, looking up API documentation, and structuring boilerplate drops significantly. It feels smooth. It feels fast. It feels like flow.
But here's what's happening beneath that feeling:
The verification bottleneck. METR found developers accepted less than 44% of AI-generated suggestions. That means more than half the time, they reviewed code, tested it, modified it, and ultimately rejected it. All that review time doesn't feel like work. It feels like "just checking." But it adds up.
Context switching costs. Every time you shift from writing code to evaluating AI suggestions, there's a mental gear change. You're not creating anymore; you're critiquing. That switch is cognitively expensive, but it's invisible.
The integration tax. AI suggestions often need adaptation for your specific codebase. Variable naming conventions. Architectural patterns. Edge cases the model didn't anticipate. That adaptation work happens after the "fast" generation, making it feel separate from the AI's contribution.
Debugging transferred errors. When AI-generated code has subtle bugs (and research shows 45% of it contains security flaws), the debugging happens later. You don't associate that pain with the original "fast" generation. It becomes tomorrow's problem.
All these hidden costs get absorbed into the background. What you remember is watching code appear on screen. What the stopwatch measures includes everything else.
Stack Overflow's December 2025 survey marked the first-ever decline in AI tool sentiment: 60% positive, down from 70%+ in previous years. Only 3% of developers "highly trust" AI output, while 46% actively distrust it. The honeymoon period might be ending.
When AI Actually Helps
I don't want to give the impression that AI coding tools are useless. That's not what METR's study shows, and it's not what I believe based on my own experience.
The study was explicit about limitations: "It seems plausible or likely that AI tools are useful in many other contexts different from their setting."
What contexts?
Less experienced developers. If you're learning a new language or framework, AI suggestions act as instant documentation and examples. You're not just getting code; you're getting implicit teaching about idioms and patterns. That's genuinely valuable.
Unfamiliar codebases. When you don't already know the code intimately, AI can help you explore and understand faster. It's like having a colleague who's read all the files you haven't.
Boilerplate and repetitive tasks. Setup files, configuration, test scaffolding, CRUD operations. The stuff where you know exactly what you need but typing it is tedious. AI shines here because the verification cost is low. You can spot-check quickly.
Documentation generation. Explaining what code does in natural language is something LLMs are genuinely good at. Converting code to docs has a different cost-benefit profile than generating code from scratch.
Exploring solution spaces. When you're not sure how to approach a problem, having AI generate multiple potential solutions can accelerate your thinking. Even if you reject all the suggestions, you've mapped the territory faster.
I've written before about a Google engineer who claimed Claude Code built a year's worth of work in an hour. The nuance mattered: she specifically said to test AI "on a domain you are already an expert of." That's different from depending on AI for work in your core expertise area.
The question isn't whether AI coding tools can help. It's whether they help you, on your tasks, in your context.
When AI Might Hurt
METR's study found the slowdown was worse in specific conditions.
Developers with high familiarity were slowed down more. The better you know your codebase, the less AI helps. You already have the patterns internalised. AI suggestions are more likely to be wrong or suboptimal compared to what you'd write yourself.
Complex, mature codebases increase the slowdown. The repositories in this study averaged 1 million+ lines of code and 10 years of history. AI models can't see the full picture, so their suggestions miss crucial dependencies and conventions.
There's also the security angle. Veracode's 2025 GenAI Code Security Report found 45% of AI-generated code contains security flaws (Veracode, 2025). For Cross-Site Scripting vulnerabilities specifically, AI models failed to generate secure code 86% of the time.
When I talk to senior developers, I hear the same concern repeatedly: AI tools optimise for getting something working fast. They don't optimise for maintainability, security, or long-term architectural coherence. If you're maintaining systems that need to work reliably for years, that's a problem.
How to Use AI Tools Smarter
Here's what I've changed in my own workflow since reading this research.
Task-based toggling. I don't leave AI assistance on for everything anymore. When I'm working on a codebase I know well, doing complex algorithmic work, or writing security-sensitive code, I turn it off. When I'm exploring unfamiliar territory, writing boilerplate, or generating documentation, I turn it on.
Still review the code. I've talked to developers who accept AI suggestions reflexively because it feels like the tool "knows what it's doing." It doesn't. The 44% acceptance rate in METR's study suggests experienced developers reject more than half of suggestions. That review step isn't optional.
Measure your actual productivity. Not how you feel. What you ship. I started tracking completion times on similar tasks with and without AI. My results were messier than METR's, but the exercise was enlightening. Your mileage will vary.
Deliberate skill maintenance. There's a real risk that over-reliance on AI atrophies fundamental coding skills. When the model goes down (and it will), you still need to function. I carve out time for AI-free coding specifically to maintain those muscles.
The Ralph Wiggum technique for autonomous coding loops that I covered recently is a perfect example of where AI shines. You feed an AI assistant the same task repeatedly, letting it iterate on its own output. Great for certain kinds of automated refactoring or exploration. But you wouldn't run that on production code without human review, and you wouldn't use it for everything.
What This Means for Teams
If you're managing developers, this research has implications.
Don't assume universal benefit. The developer who just joined might genuinely be helped by AI on your codebase. The one who wrote half the code over the past five years might be slowed down. Same tools, opposite effects.
Be sceptical of productivity metrics. If you're measuring commits, lines of code, or story points, those metrics can all go up while actual value delivery stays flat. AI makes it trivially easy to generate volume. Volume isn't value.
Code quality matters more than ever. If 29% of new code is AI-generated (per Science journal research), and 45% of AI-generated code has security flaws, your review processes need to catch more. Human oversight isn't diminishing in importance. It's increasing.
The Uncomfortable Truth
Here's what sits with me after months of thinking about this study.
We've been sold a story about AI coding tools that doesn't match the evidence. Not because vendors are lying (though the incentives certainly point that way), but because our own perceptions are unreliable. We feel more productive. The data says otherwise. And when subjective experience contradicts objective measurement, we tend to believe our experience.
The METR study isn't the final word. It's 16 developers, 246 tasks, one particular set of tools at one point in time. The models have improved since early 2025. Workflows are evolving. Future studies might find different results.
But it's the most rigorous study we have. And until someone produces equally rigorous evidence showing AI tools help experienced developers on real codebases, I'm going to remain skeptical of the hype.
The tools aren't bad. The technology is genuinely impressive. What's bad is the narrative that they're a universal productivity multiplier. What's dangerous is making decisions based on how they feel rather than how they perform.
If you're using AI coding tools, keep using them. But measure. Actually measure. Track your completion times. Compare similar tasks. Be honest about what the data shows, even when it contradicts your intuition.
Because here's the thing: in mature codebases with high quality standards and experienced developers, AI might be making people slower while making them feel faster. That gap between perception and reality is dangerous. It leads to bad decisions, wasted resources, and frustrated teams who can't figure out why shipping is taking longer despite all these "productivity" tools.
The old rules about AI coding aren't breaking down. They never existed. The new rules are being written right now, by researchers like METR who bother to actually measure instead of just asking people how they feel.
Maybe future studies will vindicate the productivity promises. Maybe the tools will improve faster than our ability to measure them. But right now, the gap between what we believe and what we can prove is wider than anyone's comfortable admitting.
---
Sources
- METR. "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." July 2025. https://metr.org/blog/2025-07-10-early-2025-ai-...
- METR. "About METR." 2025. https://metr.org/about
- Peng et al. "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv Paper 2302.06590. February 2023. https://arxiv.org/abs/2302.06590
- JetBrains. "The State of Developer Ecosystem 2025: Artificial Intelligence." October 2025. https://devecosystem-2025.jetbrains.com/artific...
- Veracode. "AI-Generated Code Security Risks: What Developers Must Know." 2025. https://www.veracode.com/blog/ai-generated-code...
- Csikszentmihalyi, Mihaly. "Flow: The Psychology of Optimal Experience." Harper Perennial. 1990. https://www.goodreads.com/book/show/66354.Flow
- InfoWorld. "AI coding tools can slow down seasoned developers by 19%." July 2025. https://www.infoworld.com/article/4020931/ai-co...
- InfoWorld. "85% of developers use AI regularly, JetBrains survey." October 2025. https://www.infoworld.com/article/4077352/85-of...
---
