Why MTTR is misleading

Core Argument

While MTTR (Mean Time To Remediate/Resolve/Recover/Respond) is a widely used metric, it's often flawed, misleading, and insufficient for understanding the reliability of complex distributed systems, especially when used in isolation or as a primary target. A shift towards customer-centric metrics like SLOs, coupled with Error Budgets and qualitative understanding, provides a more accurate and actionable approach to reliability.

Problem: Why MTTR is "Dead" (or at least highly problematic)

Ambiguity of "R": The industry lacks a consistent definition for the "R" in MTTR (Respond, Remediate, Resolve, Recover), leading to inconsistent measurement and understanding across and even within organizations.
Misleading Statistic (The "M" - Mean):
- Incident resolution times rarely follow a normal distribution; they often have a long tail.
- Using the "mean" for such distributions is mathematically misleading and can hide critical outliers or skew the perception of typical recovery times.
- Percentiles (e.g., P50, P90, P99) would be more statistically sound if this type of metric were to be used, but even then, fundamental issues remain.
Doesn't Correlate with Severity/Impact:
- A low MTTR doesn't necessarily mean low customer impact (e.g., many quick, minor issues vs. one prolonged critical outage).
- A long MTTR for a non-customer-facing or low-impact issue might be acceptable.
- It doesn't reflect the business impact or customer pain caused by an outage.
Lacks Statistical Significance for Complex Systems:
- For MTTR to be a statistically significant measure of improvement or degradation, an organization would need an impractically large number of similar incidents, more than even large companies like Google experience for specific failure modes.
Gaming the Metric: When MTTR becomes a target (e.g., for OKRs, promotions), it incentivizes behaviors that improve the number without necessarily improving reliability (e.g., closing tickets quickly, not classifying minor issues as incidents).
Incident Uniqueness & Human Factor:
- Software incidents, especially in complex systems, are often unique due to varying human responses, system states, and cascading effects, making direct comparison difficult. Aggregating them into a single MTTR blurs valuable context.
Borrowed from Manufacturing: MTTR originated in hardware manufacturing where failure modes and repair processes are more homogenous and predictable. This doesn't translate well to the dynamic and complex nature of software systems.
Not Actionable Alone: Knowing MTTR went up or down doesn't inherently tell you what to do next. It requires significant context.

Solution: Moving Beyond (or Complementing) MTTR

Focus on Customer-Centric Reliability (SLOs):
- Service Level Indicators (SLIs): Define what good performance/reliability means from the customer's perspective (e.g., request latency, error rate for critical user journeys).
- Service Level Objectives (SLOs): Set targets for these SLIs, acknowledging that 100% is often an unrealistic and expensive goal. This defines "good enough."
- Error Budgets: The inverse of the SLO target (100% - SLO%). This quantifiable "budget" for unreliability allows for data-driven decisions about risk, feature velocity, and reliability investments.
Improve Operational Definitions:
- Ensure clear, agreed-upon definitions for all metrics used, including what constitutes an "incident" of a particular type or severity.
- Avoid lumping dissimilar incidents (e.g., critical business outage vs. minor network blip) into a single MTTR calculation if using it for learning.
Utilize Qualitative Data:
- Talk to Engineers/Responders: Gather their perception of system stability and the effectiveness of incident response. The "Kirkpatrick Model" for training ROI can be adapted.
- Surveys: Ask engineers if the system feels more or less reliable quarter-over-quarter. This can be a powerful, albeit subjective, indicator.
Contextualize All Metrics:
- No single metric tells the whole story. MTTR, if used, should be a complement to SLOs and error budgets, not a replacement or primary driver.
- Understand the narrative behind the numbers. Why did a particular incident take longer to resolve? What were the human factors?
Measure What Matters to the Business/Customer:
- The ultimate goal is to understand and improve the customer experience and minimize business impact. Metrics should reflect this.
- SLOs directly tie technical performance to user happiness and business outcomes.

Benefits of Shifting Focus from MTTR to SLOs/Error Budgets

Actionable Insights: Error budgets provide a clear framework for making trade-offs between innovation speed and reliability.
Customer-Centric View: Aligns engineering efforts with what users actually experience and value.
Improved Communication & Alignment: Creates a shared understanding of reliability goals across engineering, product, and business stakeholders.
Data-Driven Decision Making: Replaces gut feelings or purely reactive responses with a quantifiable approach to managing risk.
Encourages Proactive Reliability Work: Error budget depletion can trigger focused efforts before major outages occur.
Realistic Expectations: Moves away from the fallacy of 100% uptime/availability.

Educate Upwards: Help leadership understand the limitations of MTTR and the benefits of SLOs/Error Budgets. Translate technical concepts into business impact.
Start with Qualitative Measures: If quantitative data is hard to get or initially misleading, begin by talking to the people closest to the systems.
Define "Good" Clearly: The process of defining SLIs/SLOs is valuable in itself for fostering shared understanding.
Iterate: Measuring reliability is a journey. Start somewhere, learn, and refine your approach.
The Human Factor is Key: Incidents are socio-technical events. Don't ignore the human element in response and learning.
Beware of Perverse Incentives: Ensure metrics (especially if tied to performance) drive the right behaviors.

Potential Challenges in Transitioning

Cultural Resistance: Moving away from familiar (even if flawed) metrics like MTTR can be difficult.
Complexity of SLO Definition: Defining meaningful SLIs/SLOs requires effort and cross-functional collaboration.
Tooling: While improving, tooling for sophisticated SLO tracking and error budget management might require investment or custom development.

Resources

Nobl9 Webinar: https://www.youtube.com/watch?v=xaGFWEO8f_c
04 - Podcasts