When companies first rolled out AI coding assistants like GitHub Copilot and Amazon CodeWhisperer, many expected a simple win: faster coding, fewer bugs, more features shipped. But reality didn’t match the hype. Teams saw more pull requests, yes-but also more rework. Developers felt like they were moving faster, but their code reviews took twice as long. The truth is, measuring AI’s impact on developer productivity isn’t about counting lines of code or how often suggestions are accepted. It’s about throughput and quality-and how they balance against each other.
Why Traditional Metrics Fail
Most teams started measuring AI tools the same way they measured humans: lines of code written, pull requests merged, tasks completed. That’s where things went wrong. A developer using an AI assistant might generate 500 lines of boilerplate in minutes. But if those lines need heavy rewriting, introduce subtle bugs, or confuse other team members, the net gain is negative. GitLab’s research in February 2025 found that teams tracking “acceptance rate”-how often developers click ‘accept’ on AI suggestions-were actually optimizing for the wrong thing. One engineering manager reported acceptance rates above 35%, yet feature delivery speed didn’t budge. Why? Because accepted suggestions often needed full rewrites anyway.Another misleading metric is time saved per task. Developers using AI tools in the METR Institute’s July 2025 study expected to finish coding tasks 24% faster. They ended up taking 19% longer. Why? Because AI-generated code didn’t match their mental model. It looked right, but it didn’t fit the project’s architecture, lacked proper tests, or ignored undocumented conventions. The time saved in writing code was lost in understanding, debugging, and refactoring it.
Throughput: What Actually Moves the Needle
Throughput isn’t about how fast a single developer writes code. It’s about how fast features reach customers. Booking.com, which deployed AI tools to over 3,500 engineers in Q3 2024, saw a 16% increase in throughput-not because developers coded faster, but because they shipped more usable features per week. How? They stopped measuring coding speed and started measuring business outcomes.They tracked:
- Features delivered per week that customers actually used
- Time from feature request to customer release (customer cycle time)
- Number of pull requests merged that passed all tests and didn’t trigger rollbacks
AI helped them automate repetitive tasks-setting up API endpoints, writing unit test stubs, generating config files. That freed up engineers to focus on complex logic, user flows, and edge cases. The result? More valuable features shipped, not just more code.
Block, another enterprise with 4,000+ engineers, built an AI agent called “codename goose” that didn’t just write code-it coordinated with product managers and QA teams to ensure AI-generated features met real business requirements. Their throughput gains came from reducing handoffs, not typing speed.
Quality: The Hidden Cost of Speed
Quality isn’t just about fewer bugs. It’s about maintainability, readability, and long-term team velocity. A study on Reddit in August 2025 captured this perfectly: a developer named u/CodeSlinger99 said, “Copilot cut my initial coding time by about 30%, but my PR review time doubled because I kept introducing subtle bugs I wouldn’t have made manually.”AI doesn’t understand context. It doesn’t know why a certain pattern was used in the codebase five years ago. It doesn’t care about documentation standards or testing coverage unless explicitly told. So when it generates code, it often ignores the unspoken rules that make a system sustainable.
That’s why AWS’s CTS-SW framework includes “tension metrics”-indicators that warn you when acceleration in one area causes slowdowns elsewhere. For example:
- Is the number of production incidents rising since AI adoption?
- Are senior engineers spending more time reviewing AI-generated code?
- Is the average time to fix a security vulnerability increasing?
At Booking.com, 63% of engineers reported concerns about long-term code maintainability-even though 78% liked AI for routine tasks. That’s a red flag. If your team can’t easily modify or extend AI-generated code, you’re building technical debt faster than you’re shipping features.
The Right Way to Measure: A Balanced Framework
The best organizations don’t rely on one metric. They use a mix of direct and indirect measurements. GetDX’s DX Core 4 framework, adopted by leading companies as of December 2024, tracks four key areas:- PR Throughput: How many pull requests are merged per week? (But only those that pass CI/CD and aren’t reverted.)
- Perceived Rate of Delivery: Survey developers: “Do you feel like you’re delivering value faster?”
- Code Quality: Static analysis scores, test coverage, security vulnerability density, and code churn (how often files are rewritten).
- Developer Experience Index: A composite score based on retention, engagement, and satisfaction surveys.
Pair this with AI-specific metrics:
- Hours saved per developer per week on repetitive tasks
- Percentage of AI-generated code that requires no changes before merge
- Time spent reviewing AI-generated code vs. human-written code
And crucially-compare teams. GetDX recommends running a controlled experiment: pick two teams working on similar products. Give one team AI tools. Keep the other on traditional tools. Track both groups for 2-3 release cycles. The team using AI might show higher PR volume, but if their bug rate is 40% higher, the trade-off isn’t worth it.
Real-World Results: What Works
Booking.com’s 16% throughput gain didn’t come from forcing AI on everyone. They started small. They trained engineers to use AI for scaffolding, not logic. They added automated checks that flagged AI-generated code for extra review. They required two engineers to sign off on any major feature generated with AI.Block’s “codename goose” didn’t replace developers-it extended them. The AI handled boilerplate and documentation. Humans focused on architecture and edge cases. They saw a 22% drop in time spent on onboarding new engineers because AI-generated code was more consistent and better documented.
Meanwhile, companies that focused only on speed saw the opposite. One fintech firm reported a 30% spike in critical security vulnerabilities after adopting Copilot widely. Their developers accepted AI suggestions without understanding the underlying libraries. The SEC flagged them in May 2025 for failing to meet auditability standards for AI-assisted code.
What You Should Do Now
If you’re considering AI coding assistants, don’t start by buying licenses. Start by measuring.- Define your goals: Are you trying to ship faster? Reduce burnout? Improve code quality? Your goal determines your metrics.
- Run a pilot: Pick one team. Give them AI tools. Track DX Core 4 metrics for 8 weeks.
- Watch for tension: Is QA overwhelmed? Are reviews taking longer? Is onboarding getting harder?
- Adjust processes: If AI-generated code is causing problems, add mandatory reviews, automated linting checks, or training on how to validate AI output.
- Measure business impact: Did features ship faster? Did customer satisfaction improve? Did support tickets drop?
There’s no magic number for AI ROI. But there is a clear pattern: teams that measure both throughput and quality succeed. Teams that chase speed alone end up slower, not faster.
What’s Next
By Q3 2026, Gartner predicts 85% of enterprises will use “tension metrics” to balance AI acceleration with system stability. The METR Institute’s randomized controlled trials are becoming the gold standard for objective measurement. And companies like GitLab and AWS are pushing the industry toward measuring business outcomes-not engineering activity.The real win isn’t writing code faster. It’s building software that lasts, scales, and delivers value-without burning out your team. AI can help with that. But only if you measure the right things.
Can AI coding assistants really improve developer productivity?
Yes-but only if you measure the right things. AI can speed up repetitive tasks like writing boilerplate, generating tests, or setting up configurations. But studies like METR’s July 2025 trial show that experienced developers often take longer to complete tasks with AI because the generated code doesn’t match their mental model or project standards. The key is balancing speed with quality. Teams that track both throughput and code maintainability see real gains. Teams that only track lines of code or acceptance rates often see no net improvement-or even slowdowns.
What’s the biggest mistake companies make when measuring AI productivity?
Focusing on acceptance rate. Just because a developer clicks “accept” on an AI suggestion doesn’t mean it’s good code. Many accepted suggestions require heavy editing, introduce bugs, or break architectural patterns. GitLab’s research found teams with 35%+ acceptance rates saw no improvement in feature delivery speed. The real metric is: how many AI-generated changes made it to production without rework? That’s what matters.
Should I use AI for everything in my codebase?
No. AI works best for predictable, repetitive tasks: setting up API routes, writing unit test skeletons, generating config files, or translating comments into code. It struggles with complex logic, edge cases, and systems with undocumented conventions. Avoid using AI for core business logic, security-critical components, or anything that requires deep domain knowledge. Use it to reduce grunt work-not to replace judgment.
How long does it take to see ROI from AI coding assistants?
Most teams see a temporary dip in productivity during the first 6-8 weeks as they adapt workflows. Senior engineers spend more time reviewing AI-generated code. Junior engineers might rely on it too heavily. After 2-3 months, teams that adjust their code review processes, add automated checks, and train developers on how to validate AI output start seeing gains. Booking.com reported measurable throughput improvements after 3 months. The key is patience and process change-not just tool adoption.
Is there a risk of technical debt from using AI coding assistants?
Absolutely. AI doesn’t understand your codebase’s history or unwritten rules. It might generate code that’s syntactically correct but violates architectural patterns, lacks proper documentation, or ignores testing standards. Over time, this creates “AI debt”-code that’s hard to maintain, debug, or extend. Companies like Booking.com and Block mitigate this by requiring two engineers to review major AI-generated features and by using static analysis tools to flag AI-generated code for extra scrutiny. Without these safeguards, technical debt grows faster than features.
AI coding assistants aren’t magic. They’re tools-like version control or automated testing. Used poorly, they slow you down. Used wisely, they free your team to focus on what matters: building software that users love.
Michael Gradwell
AI tools are just glorified autocomplete for people who don’t want to think
Acceptance rate is a vanity metric because most devs just click accept then spend 3 hours fixing the mess
Stop pretending tech is magic it’s just lazy coding with extra steps