Principled Performance Analytics - Cohorting for Consistency

Problem

Traditional SLOs, often based on aggregate metrics and fixed thresholds, suffer from:

Why do workloads matter?

A service's performance for a specific, well-defined type of workload should remain consistent over time. Significant deviations from this historical consistency are more meaningful indicators of problems than breaches of arbitrary, aggregate thresholds.

Solution

  1. Partition Workloads into Cohorts:
    • Group incoming requests/events into "cohorts" that approximate a shared "intent" or key workload characteristic.
    • Examples: Specific API endpoint, customer ID, request size, cache hit/miss status.
    • Method: Often achieved by taking the cross-product of a few (3-5) identifying features.
  2. Build Historical Baselines for Each Cohort:
    • For each cohort, analyse its historical performance (e.g., latency, error rate) over a defined period (e.g., last 30 days).
    • Model this performance with a statistical distribution (e.g., normal or log-normal), capturing its typical mean (μ) and standard deviation (σ).
  3. Score New Events with Z-Scores:
    • When a new event arrives, classify it into its appropriate cohort.
    • Calculate its Z-score relative to its cohort's historical baseline: (observed_value - μ_cohort) / σ_cohort.
    • The Z-score normalizes performance, indicating how many standard deviations the current event is from its cohort's historical norm.
  4. Detect "Excursions" (Performance Shifts):
    • Monitor the fraction of events (globally or per-cohort) within a given time window whose Z-score exceeds a defined threshold (e.g., Z-score > 2, indicating an event is more than two standard deviations worse than its historical norm).
    • A sustained, significant increase in this fraction signals a performance regression or "excursion."

Benefits

Resources