openstatus logoPricingDashboard

What Is MTTR? (And the Other MTT-Whatevers)

May 09, 2026 | by openstatus | [fundamentals]

There are four different "MTT-" acronyms in incident management. They all sound similar, get used interchangeably, and measure completely different things. The result is engineering org-charts arguing about whose number is better when they aren't even measuring the same phenomenon.

MTTR is the headline one - the average time from incident detection to service restored. It's the standard benchmark for how good a team is at handling outages. Here's what it actually measures, how it relates to the other three MTT-metrics, and how to actually move the number.


The Four MTT-Metrics

Most incident timelines have four phases. Each has its own metric.

Problem starts ──── Detection ──── Acknowledgment ──── Resolution
       |                |                 |                 |
       └──── MTTD ─────┘                 |                 |
                        └─── MTTA ──────┘                 |
                                          └─── MTTR ──────┘

MTTD - Mean Time To Detect

Time from the problem actually starting to your team noticing it.

Driven by: monitoring quality, alert configuration, observability coverage.

Low MTTD means your monitors caught it before customers did. High MTTD means you found out from Twitter.

MTTA - Mean Time To Acknowledge

Time from the alert firing to a human acknowledging it.

Driven by: on-call rotation quality, escalation policies, alert noise levels.

Low MTTA means your on-call setup works. High MTTA usually means alert fatigue - too many noisy alerts, so real ones get ignored.

MTTR - Mean Time To Recovery (or Resolve)

Time from detection to service being restored.

Driven by: runbook quality, response process, recovery tooling, team experience.

This is the headline metric. When people say "MTTR" they almost always mean this one.

MTTF - Mean Time To Failure

Average time a system runs before failing.

Driven by: overall reliability, change management, system design.

Less commonly tracked. Requires a long incident history to be meaningful. Often used for hardware originally; less precise for software systems.


How to Calculate MTTR

The basic math:

MTTR = Total downtime across incidents / Number of incidents

Example: Last quarter you had 5 incidents. Their durations were 10, 20, 30, 45, and 120 minutes. Total: 225 minutes. MTTR: 225 / 5 = 45 minutes.

The catch is defining "downtime." Two definitions in common use:

  • Detection to resolution - measures response speed. Excludes the time the system was broken before you knew.
  • Impact start to resolution - measures total customer-visible downtime. Includes MTTD inside it.

Pick one. Be explicit about which. Don't switch between them.


The Average Is Misleading

The mean is a bad summary statistic for incident duration because the distribution is heavy-tailed.

Example:

  • Team A: 10 incidents, all 30 minutes. MTTR = 30 minutes.
  • Team B: 10 incidents, 9 at 5 minutes and 1 at 280 minutes. MTTR = 32 minutes.

The averages are nearly identical. Team B's customers are having a much worse experience.

Fix: track percentiles.

MetricDefinition
P50Median incident duration - "typical" incident
P9595% of incidents resolved within this time
P99The worst case you can plan around

A team with MTTR of 30 minutes and P95 of 35 minutes is consistent. A team with MTTR of 30 minutes and P95 of 4 hours is occasionally catastrophic. Same average. Very different reality.


What's a Good MTTR?

It depends on severity. A 4-hour MTTR is fine for SEV3. It's a crisis for SEV1.

Rough benchmarks for SaaS:

SeverityExcellentTypicalNeeds work
SEV1< 30m1-2h> 4h
SEV2< 1h2-4h> 8h
SEV3< 4hSame/next day> 1 week

The trend matters more than the absolute number. A team whose MTTR is dropping quarter over quarter is getting better. A team with low MTTR that's been flat for a year may just be in a quiet period.


How to Actually Lower MTTR

Three levers, each attacking a different part of the timeline:

1. Faster Detection (Lower MTTD)

Every minute of MTTD reduction is a minute off MTTR (if you measure impact-to-resolution).

2. Faster Response (Lower MTTA + Active Response Time)

  • Clear on-call rotations with proper escalation
  • Reduce alert noise so real alerts don't get ignored
  • Runbooks for common incident types
  • Clear incident management roles (incident commander, comms lead, SMEs)

Most "we took too long to fix it" postmortems are actually "we took too long to figure out what was wrong."

3. Faster Recovery (Lower Active Fix Time)

  • Feature flags to kill bad features without rolling back
  • One-click rollbacks for deploys
  • Automated runbooks for common recovery steps
  • Practiced runbooks (gameday exercises)

The teams that recover fastest aren't necessarily the smartest - they've practiced the recovery steps until they're muscle memory.


Common Mistakes

Reporting MTTR aggregated across severities. A great MTTR for SEV3 hides a bad MTTR for SEV1. Always break down by severity.

Optimizing MTTR by downgrading severity. When teams are graded on MTTR, the easiest way to "improve" is to call things lower severity. Watch for this. It's measurement gaming, not improvement.

Treating MTTR as the only metric. A team with great MTTR but 10x the incident rate isn't better than a team with slower MTTR but fewer incidents. Track MTTR alongside incident frequency. Together they tell the real story.

Confusing recovery with resolution. Service restored ≠ root cause fixed. Some teams measure MTTR to "service restored" and have a separate metric for "fully resolved" (which includes the follow-up engineering work). Either is fine - just don't mix them.

Ignoring the tail. Reporting only the mean lets one bad outage hide in the data. Always look at P95 and P99.


How MTTR Connects to SLOs

MTTR shows up directly in your SLO budget math.

If your SLO allows 43 minutes of downtime per month (99.9%), and your MTTR is 60 minutes, you can afford less than one incident per month before burning the budget. If your MTTR is 5 minutes, you can afford eight incidents and still stay within budget.

Lower MTTR effectively buys you more deployment risk. Each incident costs less budget, so you can take more chances. This is why mature teams obsess over MTTR - it's the most leveraged metric for reliability improvement.


The Bottom Line

MTTR is the time from detection to recovery, averaged across incidents. It measures how good your team is at responding to and resolving outages.

Track it by severity. Look at percentiles, not just the mean. Optimize the three levers - detection, response, recovery - to bring it down. And don't game it by silently downgrading incidents.

The teams with the best MTTR aren't the ones who never have incidents. They're the ones who've practiced handling them.


OpenStatus combines monitoring (lower MTTD) with status pages and alerting (lower MTTA) in one platform - tightening the whole incident timeline.

Try openstatus free

Start free. No credit card required.