Principled Performance Analytics - Cohorting for Consistency
Problem
Traditional SLOs, often based on aggregate metrics and fixed thresholds, suffer from:
- Noise: Performance is heavily influenced by shifts in workload mix (e.g., different user behaviors, cache hit rates), obscuring real degradations.
- Difficult Calibration: Setting and maintaining appropriate thresholds for diverse and evolving workloads is a constant challenge.
- Lossy Information: Boolean "good/bad" classifications discard valuable context.
- Reactive & Late Detection: Problems are often identified only after becoming severe enough to breach broad SLOs.
Why do workloads matter?
A service's performance for a specific, well-defined type of workload should remain consistent over time. Significant deviations from this historical consistency are more meaningful indicators of problems than breaches of arbitrary, aggregate thresholds.
Solution
- Partition Workloads into Cohorts:
- Group incoming requests/events into "cohorts" that approximate a shared "intent" or key workload characteristic.
- Examples: Specific API endpoint, customer ID, request size, cache hit/miss status.
- Method: Often achieved by taking the cross-product of a few (3-5) identifying features.
- Build Historical Baselines for Each Cohort:
- For each cohort, analyse its historical performance (e.g., latency, error rate) over a defined period (e.g., last 30 days).
- Model this performance with a statistical distribution (e.g., normal or log-normal), capturing its typical mean (μ) and standard deviation (σ).
- Score New Events with Z-Scores:
- When a new event arrives, classify it into its appropriate cohort.
- Calculate its Z-score relative to its cohort's historical baseline:
(observed_value - μ_cohort) / σ_cohort
. - The Z-score normalizes performance, indicating how many standard deviations the current event is from its cohort's historical norm.
- Detect "Excursions" (Performance Shifts):
- Monitor the fraction of events (globally or per-cohort) within a given time window whose Z-score exceeds a defined threshold (e.g., Z-score > 2, indicating an event is more than two standard deviations worse than its historical norm).
- A sustained, significant increase in this fraction signals a performance regression or "excursion."
Benefits
- Significant Noise Reduction: Controls for expected variations due to workload mix, making real signals clearer.
- Increased Sensitivity & Proactive Detection: Can identify subtle degradations much earlier (e.g., 18-hour lead time in one Google example) than traditional threshold-based alerts.
- Relatively Calibration-Free: Baselines are dynamically derived from historical data, reducing the need for manual tuning of static thresholds.
- Improved Diagnosis:
- Identifies which specific cohorts/customers are most affected by an excursion.
- Can be applied hierarchically to system metrics to trace the root cause of an issue (e.g., total_time -> exec_time -> io_time).
- Quantitative Impact Assessment: The "area under the curve" of an excursion (severity and duration) provides a metric for its impact.
- Measuring "Decoration" (Correlation Analysis): Can reveal when the performance of previously independent components (like a service and its dependency) starts to correlate negatively, suggesting one is impacting the other.