Why MTTR is misleading

Core Argument

While MTTR (Mean Time To Remediate/Resolve/Recover/Respond) is a widely used metric, it's often flawed, misleading, and insufficient for understanding the reliability of complex distributed systems, especially when used in isolation or as a primary target. A shift towards customer-centric metrics like SLOs, coupled with Error Budgets and qualitative understanding, provides a more accurate and actionable approach to reliability.

Problem: Why MTTR is "Dead" (or at least highly problematic)

Solution: Moving Beyond (or Complementing) MTTR

  1. Focus on Customer-Centric Reliability (SLOs):
    • Service Level Indicators (SLIs): Define what good performance/reliability means from the customer's perspective (e.g., request latency, error rate for critical user journeys).
    • Service Level Objectives (SLOs): Set targets for these SLIs, acknowledging that 100% is often an unrealistic and expensive goal. This defines "good enough."
    • Error Budgets: The inverse of the SLO target (100% - SLO%). This quantifiable "budget" for unreliability allows for data-driven decisions about risk, feature velocity, and reliability investments.
  2. Improve Operational Definitions:
    • Ensure clear, agreed-upon definitions for all metrics used, including what constitutes an "incident" of a particular type or severity.
    • Avoid lumping dissimilar incidents (e.g., critical business outage vs. minor network blip) into a single MTTR calculation if using it for learning.
  3. Utilize Qualitative Data:
    • Talk to Engineers/Responders: Gather their perception of system stability and the effectiveness of incident response. The "Kirkpatrick Model" for training ROI can be adapted.
    • Surveys: Ask engineers if the system feels more or less reliable quarter-over-quarter. This can be a powerful, albeit subjective, indicator.
  4. Contextualize All Metrics:
    • No single metric tells the whole story. MTTR, if used, should be a complement to SLOs and error budgets, not a replacement or primary driver.
    • Understand the narrative behind the numbers. Why did a particular incident take longer to resolve? What were the human factors?
  5. Measure What Matters to the Business/Customer:
    • The ultimate goal is to understand and improve the customer experience and minimize business impact. Metrics should reflect this.
    • SLOs directly tie technical performance to user happiness and business outcomes.

Benefits of Shifting Focus from MTTR to SLOs/Error Budgets

Key Takeaways for SREs

Potential Challenges in Transitioning

Resources