Creating Systems that are safe

Core Challenges Discussed:

Key Solutions & Concepts Presented:

Highlights

  1. Observability is About Understanding, Not Just Data Collection: It's the capability to ask arbitrary questions about your system without knowing ahead of time what you'll need to ask.
  2. Confidence is Key for Velocity: Robust observability practices and tools build confidence, enabling faster and safer software delivery, including "deploying on Fridays."
  3. Humans + Machines, Not Humans vs. Machines: AI and machine learning can assist in sifting through data and suggesting areas of interest, but human expertise is vital for interpretation and driving to root causes.
  4. SRE's Evolving Role: The SRE function is increasingly part of a broader "platform engineering" or "engineering productivity" effort, serving internal developers and ultimately end-users.
  5. Know Your Baseline to Know Your Deviations: Understanding normal system behaviour (the baseline) is critical for identifying and addressing impactful changes or degradations.
  6. Pre-emptive Action & Learning is Better than Reactive Firefighting: Investing in pre-mortems and continuous investigation of production helps prevent incidents and builds system knowledge. System safety can be viewed as a control problem, aiming to prevent systems from entering hazardous states (a key concept in STAMP).

Resources