STAMP Theory in SRE

STAMP, or System-Theoretic Accident Model and Processes, is a framework developed by Professor Nancy Leveson at MIT. It represents a paradigm shift in how system safety and reliability are approached, moving beyond traditional component-failure-centric views.

Key Concepts & Principles:

  1. The Shift from Component Failures to Control Problems:

    • Traditional approaches often focus on preventing individual component failures and view accidents as a linear chain of such failures (e.g., "root cause analysis"). This can be visualized as a direct progression from normal operations to loss:
      Control flow of a system without hazard states.png
      Control flow of a system without hazard states (Traditional View)
    • STAMP, conversely, views accidents as control problems. It emphasizes that failures often arise from inadequately controlled interactions between system components (which can include human operators, software, and hardware), even when these individual components are functioning as designed. The focus is on flaws in the system's design, its control structures, and the assumptions embedded within them.
  2. Core Tools: CAST and STPA:

    • CAST (Causal Analysis based on Systems Theory): This tool is used for post-incident investigations. It helps analyze accidents by looking at the entire socio-technical system, identifying failures in control, and understanding why the existing controls were ineffective, rather than just pinpointing component failures.
    • STPA (System-Theoretic Process Analysis): This is a proactive hazard analysis technique. STPA is used before accidents occur to identify potential system hazards, unsafe control actions that could lead to those hazards, and the system design flaws or contextual factors that could cause those unsafe actions.
  3. The Four Conditions for Control (from Ashby):
    STAMP heavily incorporates W.R. Ashby's principles from cybernetics, which state that for effective control, four conditions must be met by a controller:

    1. Goal Condition: The controller must have a clear goal or setpoint.
    2. Action Condition: The controller must be able to affect the state of the system (via actuators).
    3. Model Condition: The controller must possess (or contain) an accurate model of the system it is controlling, including how its actions affect the system.
    4. Observability Condition: The controller must be able to ascertain the current state of the system (via sensors or feedback).
      Leveson adapted these as a checklist to ensure the necessary elements for effective safety control are in place.
  4. Hazard States and Unsafe Control Actions (UCAs):

    • Hazard State: Defined as "a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to a loss." A crucial insight is that a system can be in a hazard state for a considerable time before an accident occurs, offering a window for detection and intervention. It's a property of the system as a whole. STAMP formalizes this progression:
      Diagram showing process flow from Normal operations.png
      Diagram showing process flow from Normal operations through Hazard state to Loss Operations (STAMP View)
    • Unsafe Control Actions (UCAs): These are control actions (or the absence of required actions) that, under certain conditions and a particular system state, can lead to a hazard. STPA identifies four types of UCAs:
      1. A required control action is not provided.
      2. An incorrect or inadequate control action is provided.
      3. A control action is provided at the wrong time or in the wrong sequence.
      4. A control action is stopped too soon or applied for too long.
  5. The Google Quota Rightsizer Example (Practical Illustration):

    • Google SRE applied STPA to a "quota rightsizer" system designed to automatically adjust resource quotas. The control loop for this system can be visualized as follows:
      Control Flow of The Rightsizer quota-management system.png
      Control Flow of The Rightsizer quota-management system
    • In this diagram: (1) is the Rightsizer, (4) is the Quota Service, (3) is the "Reduce quota" control action, and (2) is the "Current usage" feedback.
    • A key UCA identified was: "rightsizer reduces the assigned quota under what the service currently requires."
    • The analysis highlighted that a critical path to this UCA was flawed feedback: if the rightsizer received incorrect information about a service's actual usage (violating Ashby's "Model Condition" or "Observability Condition" for the rightsizer's control loop), it could, while functioning as designed, unsafely reduce the quota.
    • A real 2021 incident involved the rightsizer receiving incorrect feedback, leading it to calculate a dangerously low new quota. A safety delay was in place, but feedback about the pending unsafe change was missing, so the system remained in a hazard state for weeks until the change was applied, causing an outage. STPA helped anticipate such scenarios by focusing on the control loop and its potential failure modes.
  6. How STAMP Addresses "Never Occur" Losses and Design Correctness:

    • "Never Occur" Losses: For systems where certain losses are unacceptable (e.g., privacy breaches, critical data loss – an "error budget of zero"), traditional reliability approaches based on SLOs and error budgets are insufficient. STAMP's focus on identifying and preventing entry into hazard states, and ensuring adequate system-level control, is better suited to designing systems where such losses must be absolutely prevented.
    • "Is the way we designed it correct?": Traditional SRE practices often ensure a system operates according to its design. STAMP, particularly through STPA, takes a step back and fundamentally questions whether the design itself is correct and safe. It helps to analyze if the control structures, responsibilities, and assumptions in the design are adequate to prevent hazards, rather than just verifying an implementation against potentially flawed or incomplete requirements. It shifts from reactive "break-fix" to proactive system safety engineering.

In essence, STAMP provides SREs with a more powerful, system-theoretic lens to understand and manage the increasing complexity of modern systems, pushing beyond component-level reliability to achieve comprehensive system safety.