AI in Production - From Hype to Practical SRE Applications

TL;DR / Abstract

Todd Underwood contrasts the unfulfilled hype of early AIOps with the practical reality of modern LLMs, which now serve as Human Augmentation tools for SREs (e.g., config generation, incident triage). He asserts that for any ML-powered service, the only meaningful reliability metric is Model Quality as an SLO.


Core Problem / Challenge

Key Solutions & Concepts Presented

Memorable Quotes

"End to end model quality is the only SLO that people working on reliability for ML systems can have."

"What is it like to do technical work in a world where the execution becomes less and less important but the architecture, and the purpose, and the design are still important?"


Connecting to My Existing Knowledge


Actionable Ideas & Open Questions