AI in Production - From Hype to Practical SRE Applications

#podcast #sre #aiops #ml-reliability #human-augmentation

TL;DR / Abstract

Todd Underwood contrasts the unfulfilled hype of early AIOps with the practical reality of modern LLMs, which now serve as Human Augmentation tools for SREs (e.g., config generation, incident triage). He asserts that for any ML-powered service, the only meaningful reliability metric is Model Quality as an SLO.

Core Problem / Challenge

The failure of early "AIOps" to live up to its hype, often creating more operational toil instead of reducing it.
The difficulty of applying traditional SRE principles to ML systems. A model can be "up" according to health checks but still produce harmful, low-quality output, making standard metrics insufficient.
Navigating the "messy middle" of operational work: transitioning from manual configuration to directing AI-driven execution
Balancing market pressure for high-velocity model releases with responsibility for safety and reliability of highly-leveraged technologies

Key Solutions & Concepts Presented

Human Augmentation: Most practical current use of AI in operations is augmenting SRE capabilities, not full automation:
- First-draft generation for Terraform configs or Helm charts
- Incident triage assistance by suggesting relevant metrics/dashboards
- Conversational documentation interface using LLMs to query runbooks, postmortems, and design docs
Model Quality as an SLO: For ML-driven services, model output quality is the service's reliability. A fraud model flagging all transactions is equivalent to the payment service being down
ML - Citations and Trust: AI-driven operational tools must provide citations and deep links to source documents, allowing human operators to verify reasoning
ML - Responsible Scaling Policy: Proactive risk management framework tying model capabilities to required security/safety controls before release
Correlation for Fault Isolation: Simple diagnostic principle - single model degradation suggests model-specific issues; multiple simultaneous degradations point to systemic infrastructure failures

Memorable Quotes

"End to end model quality is the only SLO that people working on reliability for ML systems can have."

"What is it like to do technical work in a world where the execution becomes less and less important but the architecture, and the purpose, and the design are still important?"

Connecting to My Existing Knowledge

The concept of Model Quality as an SLO directly extends the user-centric principles from Why MTTR is misleading and connects to the evolution of SRE practices discussed in Google - The Evolution of SRE. It forces a shift from measuring reliability by process health to actual user-perceived service quality.
The discussion on Human Augmentation parallels the automation maturity levels - we're at "driver assistance" rather than "full automation." This connects to the platform engineering approaches in Github - How we tackle platform problems where humans remain central to complex system management.
The Responsible Scaling Policy framework can be seen as a form of structured, proactive threat modeling for AI-specific risks, directly extending the reliability principles from Building Reliable Systems.
The emphasis on Citations and Trust aligns with the observability principles in Honeycomb - How Much Should I Be Spending On Observability - the need for transparent, debuggable systems where operators can trace decisions back to their sources.
Some of the challenges of managing ML models can be mitigated with System-Theoretic Accident Model and Processes - ML systems create complex control loops where traditional component-failure thinking breaks down. The "model producing harmful output while appearing healthy" problem is exactly the kind of hazard state STAMP was designed to analyze. STPA could proactively identify unsafe control actions in ML feedback loops, while CAST could help analyze incidents where model degradation went undetected.

Actionable Ideas & Open Questions

Do our platform services track model quality as an SLO, beyond just latency/errors? Propose introducing model quality SLOs for all ML-powered features.
Evaluate tools like NotebookLM for creating interactive Q&A systems on top of our runbooks and incident postmortems to accelerate onboarding and facilitate faster incident resolution
Assess our observability stack's "correlation for fault isolation" capabilities - can we automatically identify shared dependencies when multiple services degrade simultaneously?
How does this change my perspective on the automation vs. augmentation tradeoffs discussed in our platform engineering decisions?