Why MTTR is misleading
Core Argument
While MTTR (Mean Time To Remediate/Resolve/Recover/Respond) is a widely used metric, it's often flawed, misleading, and insufficient for understanding the reliability of complex distributed systems, especially when used in isolation or as a primary target. A shift towards customer-centric metrics like SLOs, coupled with Error Budgets and qualitative understanding, provides a more accurate and actionable approach to reliability.
Problem: Why MTTR is "Dead" (or at least highly problematic)
- Ambiguity of "R": The industry lacks a consistent definition for the "R" in MTTR (Respond, Remediate, Resolve, Recover), leading to inconsistent measurement and understanding across and even within organizations.
- Misleading Statistic (The "M" - Mean):
- Incident resolution times rarely follow a normal distribution; they often have a long tail.
- Using the "mean" for such distributions is mathematically misleading and can hide critical outliers or skew the perception of typical recovery times.
- Percentiles (e.g., P50, P90, P99) would be more statistically sound if this type of metric were to be used, but even then, fundamental issues remain.
- Doesn't Correlate with Severity/Impact:
- A low MTTR doesn't necessarily mean low customer impact (e.g., many quick, minor issues vs. one prolonged critical outage).
- A long MTTR for a non-customer-facing or low-impact issue might be acceptable.
- It doesn't reflect the business impact or customer pain caused by an outage.
- Lacks Statistical Significance for Complex Systems:
- For MTTR to be a statistically significant measure of improvement or degradation, an organization would need an impractically large number of similar incidents, more than even large companies like Google experience for specific failure modes.
- Gaming the Metric: When MTTR becomes a target (e.g., for OKRs, promotions), it incentivizes behaviors that improve the number without necessarily improving reliability (e.g., closing tickets quickly, not classifying minor issues as incidents).
- Incident Uniqueness & Human Factor:
- Software incidents, especially in complex systems, are often unique due to varying human responses, system states, and cascading effects, making direct comparison difficult. Aggregating them into a single MTTR blurs valuable context.
- Borrowed from Manufacturing: MTTR originated in hardware manufacturing where failure modes and repair processes are more homogenous and predictable. This doesn't translate well to the dynamic and complex nature of software systems.
- Not Actionable Alone: Knowing MTTR went up or down doesn't inherently tell you what to do next. It requires significant context.
Solution: Moving Beyond (or Complementing) MTTR
- Focus on Customer-Centric Reliability (SLOs):
- Service Level Indicators (SLIs): Define what good performance/reliability means from the customer's perspective (e.g., request latency, error rate for critical user journeys).
- Service Level Objectives (SLOs): Set targets for these SLIs, acknowledging that 100% is often an unrealistic and expensive goal. This defines "good enough."
- Error Budgets: The inverse of the SLO target (100% - SLO%). This quantifiable "budget" for unreliability allows for data-driven decisions about risk, feature velocity, and reliability investments.
- Improve Operational Definitions:
- Ensure clear, agreed-upon definitions for all metrics used, including what constitutes an "incident" of a particular type or severity.
- Avoid lumping dissimilar incidents (e.g., critical business outage vs. minor network blip) into a single MTTR calculation if using it for learning.
- Utilize Qualitative Data:
- Talk to Engineers/Responders: Gather their perception of system stability and the effectiveness of incident response. The "Kirkpatrick Model" for training ROI can be adapted.
- Surveys: Ask engineers if the system feels more or less reliable quarter-over-quarter. This can be a powerful, albeit subjective, indicator.
- Contextualize All Metrics:
- No single metric tells the whole story. MTTR, if used, should be a complement to SLOs and error budgets, not a replacement or primary driver.
- Understand the narrative behind the numbers. Why did a particular incident take longer to resolve? What were the human factors?
- Measure What Matters to the Business/Customer:
- The ultimate goal is to understand and improve the customer experience and minimize business impact. Metrics should reflect this.
- SLOs directly tie technical performance to user happiness and business outcomes.
Benefits of Shifting Focus from MTTR to SLOs/Error Budgets
- Actionable Insights: Error budgets provide a clear framework for making trade-offs between innovation speed and reliability.
- Customer-Centric View: Aligns engineering efforts with what users actually experience and value.
- Improved Communication & Alignment: Creates a shared understanding of reliability goals across engineering, product, and business stakeholders.
- Data-Driven Decision Making: Replaces gut feelings or purely reactive responses with a quantifiable approach to managing risk.
- Encourages Proactive Reliability Work: Error budget depletion can trigger focused efforts before major outages occur.
- Realistic Expectations: Moves away from the fallacy of 100% uptime/availability.
Key Takeaways for SREs
- Educate Upwards: Help leadership understand the limitations of MTTR and the benefits of SLOs/Error Budgets. Translate technical concepts into business impact.
- Start with Qualitative Measures: If quantitative data is hard to get or initially misleading, begin by talking to the people closest to the systems.
- Define "Good" Clearly: The process of defining SLIs/SLOs is valuable in itself for fostering shared understanding.
- Iterate: Measuring reliability is a journey. Start somewhere, learn, and refine your approach.
- The Human Factor is Key: Incidents are socio-technical events. Don't ignore the human element in response and learning.
- Beware of Perverse Incentives: Ensure metrics (especially if tied to performance) drive the right behaviors.
Potential Challenges in Transitioning
- Cultural Resistance: Moving away from familiar (even if flawed) metrics like MTTR can be difficult.
- Complexity of SLO Definition: Defining meaningful SLIs/SLOs requires effort and cross-functional collaboration.
- Tooling: While improving, tooling for sophisticated SLO tracking and error budget management might require investment or custom development.
Resources
- Nobl9 Webinar: https://www.youtube.com/watch?v=xaGFWEO8f_c
- 04 - Podcasts