Principled Performance Analytics - Cohorting for Consistency

Problem

Traditional SLOs, often based on aggregate metrics and fixed thresholds, suffer from:

Noise: Performance is heavily influenced by shifts in workload mix (e.g., different user behaviors, cache hit rates), obscuring real degradations.
Difficult Calibration: Setting and maintaining appropriate thresholds for diverse and evolving workloads is a constant challenge.
Lossy Information: Boolean "good/bad" classifications discard valuable context.
Reactive & Late Detection: Problems are often identified only after becoming severe enough to breach broad SLOs.

Why do workloads matter?

A service's performance for a specific, well-defined type of workload should remain consistent over time. Significant deviations from this historical consistency are more meaningful indicators of problems than breaches of arbitrary, aggregate thresholds.

Solution

Partition Workloads into Cohorts:
- Group incoming requests/events into "cohorts" that approximate a shared "intent" or key workload characteristic.
- Examples: Specific API endpoint, customer ID, request size, cache hit/miss status.
- Method: Often achieved by taking the cross-product of a few (3-5) identifying features.
Build Historical Baselines for Each Cohort:
- For each cohort, analyse its historical performance (e.g., latency, error rate) over a defined period (e.g., last 30 days).
- Model this performance with a statistical distribution (e.g., normal or log-normal), capturing its typical mean (μ) and standard deviation (σ).
Score New Events with Z-Scores:
- When a new event arrives, classify it into its appropriate cohort.
- Calculate its Z-score relative to its cohort's historical baseline: (observed_value - μ_cohort) / σ_cohort.
- The Z-score normalizes performance, indicating how many standard deviations the current event is from its cohort's historical norm.
Detect "Excursions" (Performance Shifts):
- Monitor the fraction of events (globally or per-cohort) within a given time window whose Z-score exceeds a defined threshold (e.g., Z-score > 2, indicating an event is more than two standard deviations worse than its historical norm).
- A sustained, significant increase in this fraction signals a performance regression or "excursion."

Benefits

Significant Noise Reduction: Controls for expected variations due to workload mix, making real signals clearer.
Increased Sensitivity & Proactive Detection: Can identify subtle degradations much earlier (e.g., 18-hour lead time in one Google example) than traditional threshold-based alerts.
Relatively Calibration-Free: Baselines are dynamically derived from historical data, reducing the need for manual tuning of static thresholds.
Improved Diagnosis:
- Identifies which specific cohorts/customers are most affected by an excursion.
- Can be applied hierarchically to system metrics to trace the root cause of an issue (e.g., total_time -> exec_time -> io_time).
Quantitative Impact Assessment: The "area under the curve" of an excursion (severity and duration) provides a metric for its impact.
Measuring "Decoration" (Correlation Analysis): Can reveal when the performance of previously independent components (like a service and its dependency) starts to correlate negatively, suggesting one is impacting the other.

Resources

SRE Con Talk: https://www.youtube.com/watch?v=zOu5cLBu4LI
Profiling Data
04 - Podcasts