MTTD and MTTR in QA: How to Measure and Reduce Detection and Repair Time
MTTD (Mean Time to Detect) measures how long bugs hide before being found. MTTR (Mean Time to Repair) measures how long fixes take. Both are lagging indicators of QA maturity — but both can be systematically reduced through specific practices. This post gives you the definitions, benchmarks, and reduction strategies.
Key Takeaways
- MTTD starts when a bug is introduced, not when the ticket is filed — most teams measure this wrong
- Elite SaaS teams target MTTD under 5 minutes for P1 issues using synthetic monitoring
- MTTR includes detection + diagnosis + fix + deploy + verification — teams often underestimate diagnosis time
- The most effective MTTD reducer is scheduled automated testing against production
- The most effective MTTR reducer is having a working reproduction environment before the incident
Defining MTTD and MTTR Precisely
These two metrics are frequently misunderstood because teams measure them inconsistently. Getting the definitions right matters — because the definition determines what you measure, which determines what you optimize.
Mean Time to Detect (MTTD)
MTTD measures the average time between when a defect is introduced into a system and when it is detected.
The key word is "introduced." A bug is introduced at the moment a code change (or configuration change, or infrastructure change) creates the defective behavior. It is detected when someone — a test, a monitor, a user, a support ticket — identifies that the behavior is wrong.
Many teams measure MTTD from when a support ticket is opened. This is wrong. A bug that was introduced 3 days ago but only reported by a customer today has an MTTD of 3 days, not zero.
Correctly measuring MTTD requires:
- Identifying when the defective change was deployed (git history + deployment logs)
- Identifying when the defect was first detected (monitoring alert, customer report, error log spike)
- Calculating the difference
Mean Time to Repair (MTTR)
MTTR measures the average time from when an incident is detected to when the system is fully restored.
"Fully restored" means the bug is fixed, the fix is deployed, and the system is functioning normally. Not when the fix is written. Not when the PR is merged. When users can use the feature again.
MTTR has four components:
- Detection time — time from incident start to alert firing (= MTTD)
- Diagnosis time — time to understand root cause
- Fix time — time to write and review the fix
- Deploy time — time to deploy and verify the fix
Teams that optimize only fix time while ignoring diagnosis time rarely improve MTTR meaningfully. In practice, diagnosis is often 50–70% of total MTTR for novel bugs.
Why These Metrics Matter for QA
MTTD and MTTR are the two metrics that most directly express the cost of quality gaps.
A high MTTD means bugs are living in production for a long time, accumulating user impact, corrupting data, and hiding behind workarounds. Every hour of MTTD is an hour of degraded user experience and growing remediation complexity.
A high MTTR means that when things go wrong, your team is slow to respond and slow to fix. This compounds the impact of each incident and erodes user trust disproportionately — a 4-hour outage is not twice as bad as a 2-hour outage; it's often four times as bad in terms of user churn.
Reducing both metrics is not just an engineering hygiene concern. It's a direct business outcome:
- MTTD under 5 minutes means the average user never encounters the bug
- MTTR under 30 minutes means most incidents are resolved before the majority of users are affected
Industry Benchmarks
Benchmarks vary significantly by industry, system criticality, and company stage. These figures are drawn from the DORA State of DevOps reports and industry surveys:
MTTD Benchmarks
| Performance Level | MTTD |
|---|---|
| Elite | Under 5 minutes (automated detection) |
| High | 5–60 minutes |
| Medium | 1–24 hours |
| Low | More than 24 hours (often detected by users) |
For context: most early-stage SaaS companies operate in the Medium range. Their bugs are typically detected by users who file support tickets, then triaged by support staff who escalate to engineering. The MTTD is effectively "time until enough users complain."
MTTR Benchmarks
| Performance Level | MTTR |
|---|---|
| Elite | Under 1 hour |
| High | 1–24 hours |
| Medium | 1 day – 1 week |
| Low | More than 1 week |
The DORA 2023 report found that elite performers restore service in under an hour — this requires having automated deployment pipelines so fixes can go to production in minutes, not waiting for a deployment window.
How to Measure MTTD in Practice
The challenge with MTTD is that it requires knowing when a bug was introduced — which requires connecting incidents to deployments.
Step 1: Tag incidents to deployments
When an incident is detected, the first question to answer is: "Which deployment introduced this?" Check your deployment log for deployments in the 24–48 hours before the incident.
Most incident tracking tools (PagerDuty, OpsGenie, Linear) support linking incidents to deploys. If yours doesn't, add a deployment log field to your incident postmortem template.
Step 2: Identify the introduction timestamp
Given the deployment that introduced the bug, the introduction timestamp is the deployment timestamp. Your MTTD starts there.
If the bug was in a feature that had been deployed for weeks but only triggered under specific conditions, use the timestamp of the first triggering event you can identify in logs.
Step 3: Identify the detection timestamp
This is when the alert fired, the error log spiked, or the support ticket was created — whichever came first.
Step 4: Calculate and track
MTTD = Detection Timestamp – Introduction Timestamp
Track this for every incident. Calculate the rolling 90-day average. Plot it over time.
How to Measure MTTR in Practice
MTTR is more straightforward to measure because both endpoints (detection and resolution) are observable events.
Detection timestamp — When did the alert fire or the incident ticket open?
Resolution timestamp — When did monitoring return to normal and the incident ticket close?
MTTR = Resolution Timestamp – Detection Timestamp
The tricky part is "resolution timestamp." Avoid closing incidents before the fix is verified in production. A common trap: engineers close the incident when the PR is merged, not when the deployment completes and smoke tests pass. This understates MTTR.
Strategies to Reduce MTTD
Synthetic Monitoring
Synthetic monitoring runs scripted user flows against your production environment on a schedule. A test that runs every 5 minutes and verifies that login, checkout, and core feature flows work correctly can detect most production regressions within 5 minutes of deployment.
This is the single highest-leverage MTTD reducer available. It doesn't require code changes, doesn't interfere with CI, and can be set up in hours.
Alerting on Error Rate Changes
Production error logs contain signals. An error rate that doubles after a deployment is almost always indicative of a regression. Set alerts on:
- Error rate increase > 50% in any 5-minute window
- New error types appearing in logs
- p95 latency exceeding threshold
These alerts fire before users notice and before support tickets are created.
Canary Deployments
Canary deployments route a small percentage of traffic (1–5%) to the new version before full rollout. Combine with automated analysis that compares error rates and latency between canary and stable. If the canary is worse, roll back automatically.
Canary deployments reduce MTTD to the time it takes to detect the degradation in the canary population — often minutes rather than hours.
Test Coverage for High-Risk Paths
Not all code paths are equally risky. User authentication, payment processing, and data export are high-risk paths where a bug causes immediate, severe user impact. Ensure these paths have automated test coverage so bugs are caught in CI before deployment.
Strategies to Reduce MTTR
Reduce Diagnosis Time: Structured Postmortems
The biggest time sink in MTTR is diagnosis — understanding what caused the incident. Teams that invest in structured postmortems build a playbook of past incidents and their causes. When a new incident resembles a past one, diagnosis time drops dramatically.
Maintain a searchable incident database with: symptoms, root cause, resolution steps. When a new incident fires, search the database first.
Reduce Fix Time: Runbooks
For categories of incidents you've seen before, write runbooks — step-by-step resolution guides. A runbook for "database connection pool exhaustion" or "Redis memory exceeded" reduces fix time from an hour of investigation to 10 minutes of execution.
Reduce Deploy Time: Automated Deployment Pipelines
If deploying a fix requires a human to manually trigger deployments, approve gates, or coordinate with other teams, your deploy time has a floor of however long those humans take to respond. Automated deployment pipelines with fast CI reduce this floor to minutes.
Elite performers have pipelines where merging a PR to main triggers automatic deployment in under 10 minutes.
Reduce Verification Time: Automated Smoke Tests Post-Deploy
After a fix is deployed, someone needs to verify the fix worked. Manual verification is slow. Automated smoke tests that run immediately after each production deployment and confirm the affected flow is working reduce verification time to seconds.
MTTD vs. MTTR: Which to Optimize First
Both metrics matter, but they have different leverage depending on your current position.
If MTTD > 2 hours: This is the primary problem. Users are encountering bugs for hours before you know about them. Invest in synthetic monitoring and error rate alerting first.
If MTTD < 30 minutes but MTTR > 4 hours: Detection is good; resolution is slow. Invest in runbooks, faster deployment pipelines, and postmortem discipline.
If both are high: Start with MTTD. You can't resolve what you haven't detected.
Building an MTTD/MTTR Dashboard
Track these metrics in your incident tracking system or a simple spreadsheet. The data you need per incident:
| Field | Description |
|---|---|
| Incident ID | Unique identifier |
| Severity | P0/P1/P2/P3 |
| Introduced At | Deploy timestamp that introduced the bug |
| Detected At | When alert/ticket fired |
| Resolved At | When incident closed and fix verified |
| MTTD | Detected At – Introduced At |
| MTTR | Resolved At – Detected At |
| Detection Method | Synthetic test / error alert / user report / support ticket |
| Root Cause Category | Regression / integration failure / config / infrastructure |
Aggregate monthly. Plot the 90-day rolling average. Present at team retrospectives.
How HelpMeTest Helps Reduce MTTD
The fastest path to reducing MTTD for most teams is synthetic monitoring — automated tests running against production on a schedule.
HelpMeTest lets you create scheduled tests in plain English that run against your live application every few minutes. When a test fails, you get an immediate alert. No code required, no CI integration needed, no infrastructure to manage.
For teams moving from "detected by users" to "detected by monitoring," this is the starting point. Once MTTD is under control, MTTR improvements become the next priority.
Summary
MTTD and MTTR are the two metrics that quantify how well your organization detects and responds to production defects. Correctly measuring them requires connecting incidents to the deployments that introduced them and tracking resolution through to verified production fix — not just PR merge.
The most effective MTTD reduction strategy is synthetic monitoring. The most effective MTTR reduction strategy is structured diagnosis through runbooks and postmortem discipline. Both can be improved incrementally, and both improvements pay back in direct user experience and revenue terms.