Creating Systems that are safe
Core Challenges Discussed:
- Defining Observability (vs. Monitoring/APM) & Overcoming Terminology Misuse: The industry often rebrands existing APM/monitoring tools as "observability" without adopting the core principles, leading to confusion.
- Understanding Complex, Distributed Systems: Traditional methods like local debuggers or simply scrolling logs are insufficient for modern, complex, and often "black box" distributed systems. This is compounded when failures arise from interactions between correctly functioning components, not just individual component failures.
- Building Confidence for Safe Deployments (e.g., on Fridays): Fear of breaking production, especially during off-peak hours or before weekends, stems from a lack of confidence in understanding and quickly diagnosing issues.
- Moving Beyond Static Dashboards to Actionable Insights: Relying solely on pre-defined dashboards and metrics can miss unknown unknowns; the need is to interactively query and understand system behaviour and the underlying control loops.
- Maintaining an Accurate Mental Model of Production: Without active engagement and the right tools, engineers' understanding of how a system actually behaves can drift from reality, leading to flawed control decisions.
Key Solutions & Concepts Presented:
- Observability as Deep Introspection (Control Theory Inspired): Borrowing from control theory, observability is about the ability to understand the internal state of a system by examining its external outputs (telemetry). It's crucial for "black box" distributed systems. Frameworks like STAMP formalize this by defining observability as one of the key conditions for effective system control and safety.
- Observability as a Spectrum (Monitoring is a part): Monitoring (knowing when systems are broken via charts/metrics) is a degree of observability, not distinct from it. True observability allows for deeper understanding.
- Interactive Investigation & Asking the Right Questions: The core of observability is the ability to ask new questions of your system and data in real-time, especially for "unknown unknowns," rather than relying on pre-canned dashboards.
- Proactive Risk Assessment (Pre-mortems & Formal Hazard Analysis): Comprehensively thinking about potential failures before a launch/change to mitigate risks or know what to monitor. Methodologies like STPA (System-Theoretic Process Analysis) from the STAMP framework offer structured approaches to identify unsafe control actions and potential hazard states.
- Data-Driven SLOs for User Happiness: SLOs should be based on user experience and backed by telemetry that allows for investigation when they are breached.
- Platform Engineering & Holistic Developer Productivity: SRE is evolving, with a trend towards platform engineering teams that unite SREs, build systems, UX platform teams, etc., to improve the overall developer and end-user experience.
- AI/Machine Assistance as a Copilot for SREs: AI can assist humans by highlighting interesting data or suggesting questions, but humans are still needed to interpret and understand the "why."
- Continuous Learning & Refining Mental Models: Actively using observability tools to investigate production helps refine engineers' mental models of how systems work and how they can fail.
Highlights
- Observability is About Understanding, Not Just Data Collection: It's the capability to ask arbitrary questions about your system without knowing ahead of time what you'll need to ask.
- Confidence is Key for Velocity: Robust observability practices and tools build confidence, enabling faster and safer software delivery, including "deploying on Fridays."
- Humans + Machines, Not Humans vs. Machines: AI and machine learning can assist in sifting through data and suggesting areas of interest, but human expertise is vital for interpretation and driving to root causes.
- SRE's Evolving Role: The SRE function is increasingly part of a broader "platform engineering" or "engineering productivity" effort, serving internal developers and ultimately end-users.
- Know Your Baseline to Know Your Deviations: Understanding normal system behaviour (the baseline) is critical for identifying and addressing impactful changes or degradations.
- Pre-emptive Action & Learning is Better than Reactive Firefighting: Investing in pre-mortems and continuous investigation of production helps prevent incidents and builds system knowledge. System safety can be viewed as a control problem, aiming to prevent systems from entering hazardous states (a key concept in STAMP).
Resources