Profiling Data

Core Challenges Discussed:

Data Overload: Modern systems generate massive amounts of performance data (profiling, metrics, traces), making it hard to store, query, and interpret.
Noise vs. Signal: Differentiating meaningful performance changes from natural system variability or workload mix shifts is a constant struggle.
Limitations of Traditional SLOs: Fixed thresholds and simple error counts often fail to capture nuanced performance issues or provide early warnings.
Accessibility of Insights: Performance data and its analysis need to be accessible to all developers, not just performance experts, to foster a culture of efficiency.

Key Solutions & Concepts Presented:

Optimizing Profiler Data Storage & Querying (Pat Somaru, Meta)

Addresses the challenge of voluminous and redundant profiler data by representing call stacks as a Directed Acyclic Graph (DAG).
Focuses on reducing data footprint and enabling efficient, powerful querying for tasks like regression detection.
Introduces the "time series of graphs" concept for understanding performance evolution.

Principled Performance Analytics - Cohorting for Consistency (Narayan Desai, Google)

Tackles the noise in traditional SLOs by focusing on workload consistency.
Proposes partitioning workloads into "cohorts" based on shared intent/characteristics.
Uses historical baselines and Z-scores for each cohort to detect statistically significant deviations ("excursions") from normal behavior.

Highlights

Data Representation & Modeling is Foundational:
- The way observability data is structured profoundly impacts its utility.
- Example: Treating profiler call stacks as graphs (Sto) or modeling diverse workloads as distinct cohorts (Two Sigma) unlocks deeper insights.
Prioritize Signal-to-Noise Ratio:
- A primary goal of advanced analytics is to reduce noise and surface true signals.
- Example: Cohorting in Two Sigma normalizes for workload mix effects; DAGs in Sto de-duplicate redundant information.
Accessibility for All Engineers:
- Performance insights shouldn't be confined to experts. Tools and systems should aim to make performance implications clear and actionable for every developer.
- Example: IDE feedback based on Sto data, simplified query interfaces.
Historical Context is More Powerful Than Static Thresholds:
- Comparing current system behavior to its own past behavior, especially within specific contexts (like workload cohorts), often provides more meaningful signals of change than comparing against arbitrary, static thresholds.
Efficiency at Scale Matters:
- Small performance changes or data storage/processing inefficiencies can have a massive cost and operational impact at scale.
- Example: Sto's data compression; Two Sigma's ability to process vast event streams.
Look Beyond the Traditional "Three Pillars":
- While metrics, logs, and traces are essential, deep system understanding often requires richer data types (like detailed profiler data) and more sophisticated analytical approaches that can model complex behaviors.
Understand Your Workloads Deeply:
- "Workloads matter." The inputs to your system and the different ways it's used are critical context for interpreting performance data. Avoid treating the system as a monolithic black box.

Core Challenges Discussed:

Key Solutions & Concepts Presented:

Highlights

Resources