AI Makes Teams Faster and Less Stable: DORA 2025

Google's 2024 DORA (DevOps Research and Assessment) report, based on data from 39,000 professionals, found something nobody expected: teams using AI extensively in their software delivery process experienced higher deployment frequency but also higher change failure rates. They shipped faster and broke more things.

The 2024 DORA report classified teams into four clusters. The top cluster — "elite" performers — maintained both speed and stability. But a new cluster emerged that the researchers had not seen in 10 years of DORA surveys: teams that were fast but unstable. High deployment frequency, high change failure rate, slow recovery time. These were disproportionately teams using AI tools extensively.

I build AI products for clients and my engineers use AI coding tools daily. The DORA finding does not surprise me. I have seen the mechanism in practice.

The Data

DORA 2024 (n=39,000):

Teams using AI extensively in delivery had higher deployment frequency
Same teams had higher change failure rates (percentage of deployments that cause production incidents)
The "fast but unstable" cluster appeared for the first time in DORA's history
AI tool usage was the strongest correlating factor for this new cluster

The four DORA metrics that matter:

Deployment frequency (how often you ship)
Lead time for changes (commit to production)
Change failure rate (what percentage of deploys cause incidents)
Time to restore service (how fast you recover from failures)

Elite teams score well on all four. The new "fast but unstable" cluster scores well on #1 and #2 but poorly on #3 and #4. They ship fast. They break things. They take a long time to fix what they broke.

Why AI Makes Teams Faster AND Less Stable

Volume without understanding

AI coding tools generate code faster than humans type it. A developer using Copilot or Cursor can produce 3-5x more code per hour than one typing from scratch. But producing code is not the same as understanding code. When a developer writes a function themselves, they understand every line. When they accept an AI suggestion, they understand the intent but may miss subtle implementation details.

The METR study we covered in the AI trust paradox piece found that experienced developers using AI tools were 19% slower on complex tasks. The DORA data adds a new dimension: even when AI makes developers faster on simple tasks, the accumulated effect of not fully understanding the generated code shows up later as production failures.

A developer who generates 50 lines of AI-suggested code and reviews them for 2 minutes will catch obvious errors. They will miss: subtle race conditions, incorrect error handling in edge cases, security vulnerabilities in authentication flows, and performance issues that only manifest under load. These escape review, pass tests (because the tests were also AI-generated and test the happy path), and fail in production.

Test coverage illusion

AI tools are excellent at generating tests. Ask Copilot to "write tests for this function" and it produces a comprehensive-looking test suite in seconds. The problem: AI-generated tests tend to test the implementation, not the behavior. They test what the code does, not what the code should do.

When the code is wrong in a subtle way, the AI-generated test passes because it tests the actual behavior (which is wrong) rather than the expected behavior (which requires domain understanding to specify). The test suite shows 90% coverage. The change failure rate climbs. The team has a false sense of security.

This is not hypothetical. We have seen it on client codebases we inherited. A test suite with 85% coverage that does not catch the edge cases that cause production incidents. The coverage number looks great. The defect rate tells a different story.

Deployment velocity without deployment discipline

AI tools reduce the time from "idea" to "code in a PR." That is step 1 of 5 in a deployment pipeline. Steps 2-5 — code review, integration testing, staging verification, and monitored deployment — are not accelerated by AI. They are accelerated by discipline.

When step 1 gets 3-5x faster but steps 2-5 stay the same, the bottleneck shifts. PRs pile up in review. Staging becomes a queue. The team responds by shortening reviews ("it looks fine, the AI wrote it") and skipping staging ("it passed the tests"). Deployment frequency goes up. Change failure rate goes up. DORA captures exactly this pattern.

What Elite Teams Do Differently

The DORA elite cluster uses AI but maintains stability. How?

They review AI code like human code

Every AI-generated suggestion goes through the same code review process as human-written code. The reviewer's job is to verify correctness, not just formatting. This means the reviewer must understand the business logic, the edge cases, and the failure modes. If the reviewer accepts AI code without understanding it, they are not reviewing — they are rubber-stamping.

Our code review process does not differentiate between human-written and AI-suggested code. The reviewer is accountable for what merges. If AI-generated code causes a production incident, the reviewer missed it. This creates the incentive to actually review, not just approve.

They write tests before generating code

Test-driven development (TDD) is the antidote to the AI test coverage illusion. Write the test first, based on the business requirement. Then generate or write the code that makes the test pass. The test encodes the expected behavior, not the implementation. When the AI generates code that passes the human-written test, you have confidence. When the AI generates code and the test, you have nothing.

They maintain deployment discipline

Steps 2-5 do not get faster just because step 1 did. Code review takes the time it takes. Staging verification takes the time it takes. Monitored deployment (canary releases, feature flags, gradual rollout) takes the time it takes.

The teams that maintain stability resist the pressure to ship faster by cutting these steps. They use the time savings from AI-assisted coding to write more thorough tests, do more careful reviews, and deploy with more monitoring — not to squeeze more deployments into the same calendar week.

They monitor aggressively

When deployment frequency increases, monitoring must increase proportionally. More deploys means more opportunities for failure. Real-time error tracking (Sentry, Datadog), deployment-correlated metrics (error rate before and after each deploy), and automated rollback triggers are not optional at high deployment frequencies. They are the safety net that keeps fast from becoming unstable.

The EltexSoft Practice

We use AI tools. Our engineers use Copilot, Claude, and Cursor. We accept that these tools make certain tasks faster. We do not accept that faster means we can skip review, testing, or staging.

Our deployment practice on every client engagement:

CI/CD from day one (our co-founder sets this up before the first feature)
Code review required for every merge (no self-merges, no rubber stamps)
Automated test suites that run on every PR
Staging environment that mirrors production
Monitored deployments with rollback capability

Greek House went from releases every few months to same-day deploys. Not because we shipped recklessly fast. Because the CI/CD infrastructure, the test coverage, and the review process were solid enough to support daily deployment with confidence. HeyTutor has maintained this discipline for 9 years across thousands of deployments.

The DORA data confirms what we practice: speed without stability is worse than stability without speed. The goal is both. AI tools help with speed. Engineering discipline provides stability. Skip the discipline and you join the "fast but unstable" cluster. Maintain it and you stay elite.

Talk to us →

Last updated November 24, 2024

AI Makes Your Team Faster. It Also Makes Failures Worse.