AI in Production - From Hype to Practical SRE Applications
TL;DR / Abstract
Todd Underwood contrasts the unfulfilled hype of early AIOps with the practical reality of modern LLMs, which now serve as Human Augmentation tools for SREs (e.g., config generation, incident triage). He asserts that for any ML-powered service, the only meaningful reliability metric is Model Quality as an SLO.
Core Problem / Challenge
- The failure of early "AIOps" to live up to its hype, often creating more operational toil instead of reducing it.
- The difficulty of applying traditional SRE principles to ML systems. A model can be "up" according to health checks but still produce harmful, low-quality output, making standard metrics insufficient.
- Navigating the "messy middle" of operational work: transitioning from manual configuration to directing AI-driven execution
- Balancing market pressure for high-velocity model releases with responsibility for safety and reliability of highly-leveraged technologies
Key Solutions & Concepts Presented
- Human Augmentation: Most practical current use of AI in operations is augmenting SRE capabilities, not full automation:
- First-draft generation for Terraform configs or Helm charts
- Incident triage assistance by suggesting relevant metrics/dashboards
- Conversational documentation interface using LLMs to query runbooks, postmortems, and design docs
- Model Quality as an SLO: For ML-driven services, model output quality is the service's reliability. A fraud model flagging all transactions is equivalent to the payment service being down
- ML - Citations and Trust: AI-driven operational tools must provide citations and deep links to source documents, allowing human operators to verify reasoning
- ML - Responsible Scaling Policy: Proactive risk management framework tying model capabilities to required security/safety controls before release
- Correlation for Fault Isolation: Simple diagnostic principle - single model degradation suggests model-specific issues; multiple simultaneous degradations point to systemic infrastructure failures
Memorable Quotes
"End to end model quality is the only SLO that people working on reliability for ML systems can have."
"What is it like to do technical work in a world where the execution becomes less and less important but the architecture, and the purpose, and the design are still important?"
Connecting to My Existing Knowledge
- The concept of Model Quality as an SLO directly extends the user-centric principles from Why MTTR is misleading and connects to the evolution of SRE practices discussed in Google - The Evolution of SRE. It forces a shift from measuring reliability by process health to actual user-perceived service quality.
- The discussion on Human Augmentation parallels the automation maturity levels - we're at "driver assistance" rather than "full automation." This connects to the platform engineering approaches in Github - How we tackle platform problems where humans remain central to complex system management.
- The Responsible Scaling Policy framework can be seen as a form of structured, proactive threat modeling for AI-specific risks, directly extending the reliability principles from Building Reliable Systems.
- The emphasis on Citations and Trust aligns with the observability principles in Honeycomb - How Much Should I Be Spending On Observability - the need for transparent, debuggable systems where operators can trace decisions back to their sources.
- Some of the challenges of managing ML models can be mitigated with System-Theoretic Accident Model and Processes - ML systems create complex control loops where traditional component-failure thinking breaks down. The "model producing harmful output while appearing healthy" problem is exactly the kind of hazard state STAMP was designed to analyze. STPA could proactively identify unsafe control actions in ML feedback loops, while CAST could help analyze incidents where model degradation went undetected.