MTTR most commonly stands for Mean Time To Recovery (or Mean Time To Resolve) - the average time from when an incident is detected to when service is restored. It's calculated as total downtime across incidents divided by the number of incidents. Lower is better. It's the headline metric for incident response maturity.

What's the difference between MTTR, MTTD, MTTA, and MTTF?

MTTD (Mean Time To Detect) - time from problem occurring to your team noticing. MTTA (Mean Time To Acknowledge) - time from alert firing to a human acknowledging. MTTR (Mean Time To Recovery) - time from detection to service restored. MTTF (Mean Time To Failure) - average time a system runs before failing. They're stages on a timeline; MTTR is usually the headline number.

How do you calculate MTTR?

Sum the total downtime across all incidents in a period, then divide by the number of incidents. Example: 3 incidents totaling 90 minutes of downtime = MTTR of 30 minutes. Some teams measure from detection to resolution; others from start of impact to resolution. Be explicit about which definition you use.

Depends entirely on your service criticality and severity tier. Rough benchmarks for SaaS: SEV1 incidents, under 30 minutes is excellent, 1-2 hours is typical. SEV2, under 2 hours is good. The trend matters more than the absolute number - a team whose MTTR is dropping quarter over quarter is improving; a team with low MTTR that's flat may just be in a quiet period.

Why is MTTR sometimes a misleading metric?

MTTR averages can hide outliers. Ten 5-minute incidents and one 8-hour outage average out to about 48 minutes - which looks fine, but the 8-hour outage is what customers remember. Track percentiles (P50, P95, P99) and look at the distribution, not just the mean. Also be wary of teams who 'lower MTTR' by silently downgrading incident severity.

How do you actually lower MTTR?

Three levers: detection (better monitoring lowers MTTD), response (runbooks, on-call rotations, and clear roles lower MTTA and active response time), and recovery (good rollback tooling, feature flags, and automated mitigations lower the time to actually fix things). Each one shaves minutes off the total.

What does MTTF stand for?

MTTF is Mean Time To Failure - the average uptime between failures. For SaaS, it's the inverse of how often you have incidents. High MTTF means rare incidents. It's usually a hardware reliability metric originally but applies to software systems too. Less commonly tracked than MTTR because it requires a long history of incident data to be meaningful.

What's the difference between Mean Time To Recovery and Mean Time To Resolve?

Often used interchangeably, but technically: 'recovery' is when service is back to normal for users, 'resolve' is when the underlying issue is fully fixed (which may include follow-up work after service is restored). Pick one definition for your team and use it consistently. Mixing them produces meaningless trends.

Should we track MTTR per severity level?

Yes. A 4-hour MTTR for SEV3 incidents is fine. A 4-hour MTTR for SEV1 incidents is a problem. Aggregating across severities hides important information. Most teams report MTTR separately for SEV1, SEV2, and SEV3.

How does MTTR relate to SLOs?

MTTR feeds directly into your error budget. Every minute of downtime burns budget. A team with MTTR of 4 hours has very different error budget math than one with MTTR of 20 minutes. Improving MTTR effectively lets you take more deployment risk because each incident costs less of your budget.

What Is MTTR? (And the Other MTT-Whatevers)

May 09, 2026 | by openstatus | [fundamentals]

There are four different "MTT-" acronyms in incident management. They all sound similar, get used interchangeably, and measure completely different things. The result is engineering org-charts arguing about whose number is better when they aren't even measuring the same phenomenon.

MTTR is the headline one - the average time from incident detection to service restored. It's the standard benchmark for how good a team is at handling outages. Here's what it actually measures, how it relates to the other three MTT-metrics, and how to actually move the number.

The Four MTT-Metrics

Most incident timelines have four phases. Each has its own metric.

Problem starts ──── Detection ──── Acknowledgment ──── Resolution
       |                |                 |                 |
       └──── MTTD ─────┘                 |                 |
                        └─── MTTA ──────┘                 |
                                          └─── MTTR ──────┘

MTTD - Mean Time To Detect

Time from the problem actually starting to your team noticing it.

Driven by: monitoring quality, alert configuration, observability coverage.

Low MTTD means your monitors caught it before customers did. High MTTD means you found out from Twitter.

MTTA - Mean Time To Acknowledge

Time from the alert firing to a human acknowledging it.

Driven by: on-call rotation quality, escalation policies, alert noise levels.

Low MTTA means your on-call setup works. High MTTA usually means alert fatigue - too many noisy alerts, so real ones get ignored.

MTTR - Mean Time To Recovery (or Resolve)

Time from detection to service being restored.

Driven by: runbook quality, response process, recovery tooling, team experience.

This is the headline metric. When people say "MTTR" they almost always mean this one.

MTTF - Mean Time To Failure

Average time a system runs before failing.

Driven by: overall reliability, change management, system design.

Less commonly tracked. Requires a long incident history to be meaningful. Often used for hardware originally; less precise for software systems.

How to Calculate MTTR

The basic math:

MTTR = Total downtime across incidents / Number of incidents

Example: Last quarter you had 5 incidents. Their durations were 10, 20, 30, 45, and 120 minutes. Total: 225 minutes. MTTR: 225 / 5 = 45 minutes.

The catch is defining "downtime." Two definitions in common use:

Detection to resolution - measures response speed. Excludes the time the system was broken before you knew.
Impact start to resolution - measures total customer-visible downtime. Includes MTTD inside it.

Pick one. Be explicit about which. Don't switch between them.

The Average Is Misleading

The mean is a bad summary statistic for incident duration because the distribution is heavy-tailed.

Example:

Team A: 10 incidents, all 30 minutes. MTTR = 30 minutes.
Team B: 10 incidents, 9 at 5 minutes and 1 at 280 minutes. MTTR = 32 minutes.

The averages are nearly identical. Team B's customers are having a much worse experience.

Fix: track percentiles.

Metric	Definition
P50	Median incident duration - "typical" incident
P95	95% of incidents resolved within this time
P99	The worst case you can plan around

A team with MTTR of 30 minutes and P95 of 35 minutes is consistent. A team with MTTR of 30 minutes and P95 of 4 hours is occasionally catastrophic. Same average. Very different reality.

What's a Good MTTR?

It depends on severity. A 4-hour MTTR is fine for SEV3. It's a crisis for SEV1.

Rough benchmarks for SaaS:

Severity	Excellent	Typical	Needs work
SEV1	< 30m	1-2h	> 4h
SEV2	< 1h	2-4h	> 8h
SEV3	< 4h	Same/next day	> 1 week

The trend matters more than the absolute number. A team whose MTTR is dropping quarter over quarter is getting better. A team with low MTTR that's been flat for a year may just be in a quiet period.

How to Actually Lower MTTR

Three levers, each attacking a different part of the timeline:

1. Faster Detection (Lower MTTD)

Better synthetic monitoring on critical paths
Multi-region uptime checks for fast outage detection
Better internal observability - logs, metrics, traces
Alert on user-impacting symptoms, not just system internals

Every minute of MTTD reduction is a minute off MTTR (if you measure impact-to-resolution).

2. Faster Response (Lower MTTA + Active Response Time)

Clear on-call rotations with proper escalation
Reduce alert noise so real alerts don't get ignored
Runbooks for common incident types
Clear incident management roles (incident commander, comms lead, SMEs)

Most "we took too long to fix it" postmortems are actually "we took too long to figure out what was wrong."

3. Faster Recovery (Lower Active Fix Time)

Feature flags to kill bad features without rolling back
One-click rollbacks for deploys
Automated runbooks for common recovery steps
Practiced runbooks (gameday exercises)

The teams that recover fastest aren't necessarily the smartest - they've practiced the recovery steps until they're muscle memory.

Common Mistakes

Reporting MTTR aggregated across severities. A great MTTR for SEV3 hides a bad MTTR for SEV1. Always break down by severity.

Optimizing MTTR by downgrading severity. When teams are graded on MTTR, the easiest way to "improve" is to call things lower severity. Watch for this. It's measurement gaming, not improvement.

Treating MTTR as the only metric. A team with great MTTR but 10x the incident rate isn't better than a team with slower MTTR but fewer incidents. Track MTTR alongside incident frequency. Together they tell the real story.

Confusing recovery with resolution. Service restored ≠ root cause fixed. Some teams measure MTTR to "service restored" and have a separate metric for "fully resolved" (which includes the follow-up engineering work). Either is fine - just don't mix them.

Ignoring the tail. Reporting only the mean lets one bad outage hide in the data. Always look at P95 and P99.

How MTTR Connects to SLOs

MTTR shows up directly in your SLO budget math.

If your SLO allows 43 minutes of downtime per month (99.9%), and your MTTR is 60 minutes, you can afford less than one incident per month before burning the budget. If your MTTR is 5 minutes, you can afford eight incidents and still stay within budget.

Lower MTTR effectively buys you more deployment risk. Each incident costs less budget, so you can take more chances. This is why mature teams obsess over MTTR - it's the most leveraged metric for reliability improvement.

The Bottom Line

MTTR is the time from detection to recovery, averaged across incidents. It measures how good your team is at responding to and resolving outages.

Track it by severity. Look at percentiles, not just the mean. Optimize the three levers - detection, response, recovery - to bring it down. And don't game it by silently downgrading incidents.

The teams with the best MTTR aren't the ones who never have incidents. They're the ones who've practiced handling them.

OpenStatus combines monitoring (lower MTTD) with status pages and alerting (lower MTTA) in one platform - tightening the whole incident timeline.

Try openstatus free

Start free. No credit card required.

Connect Openstatus to Claude Code (MCP Setup Guide)