Profiling Data
Core Challenges Discussed:
- Data Overload: Modern systems generate massive amounts of performance data (profiling, metrics, traces), making it hard to store, query, and interpret.
- Noise vs. Signal: Differentiating meaningful performance changes from natural system variability or workload mix shifts is a constant struggle.
- Limitations of Traditional SLOs: Fixed thresholds and simple error counts often fail to capture nuanced performance issues or provide early warnings.
- Accessibility of Insights: Performance data and its analysis need to be accessible to all developers, not just performance experts, to foster a culture of efficiency.
Key Solutions & Concepts Presented:
Optimizing Profiler Data Storage & Querying (Pat Somaru, Meta)
- Addresses the challenge of voluminous and redundant profiler data by representing call stacks as a Directed Acyclic Graph (DAG).
- Focuses on reducing data footprint and enabling efficient, powerful querying for tasks like regression detection.
- Introduces the "time series of graphs" concept for understanding performance evolution.
Principled Performance Analytics - Cohorting for Consistency (Narayan Desai, Google)
- Tackles the noise in traditional SLOs by focusing on workload consistency.
- Proposes partitioning workloads into "cohorts" based on shared intent/characteristics.
- Uses historical baselines and Z-scores for each cohort to detect statistically significant deviations ("excursions") from normal behavior.
Highlights
- Data Representation & Modeling is Foundational:
- The way observability data is structured profoundly impacts its utility.
- Example: Treating profiler call stacks as graphs (Sto) or modeling diverse workloads as distinct cohorts (Two Sigma) unlocks deeper insights.
- Prioritize Signal-to-Noise Ratio:
- A primary goal of advanced analytics is to reduce noise and surface true signals.
- Example: Cohorting in Two Sigma normalizes for workload mix effects; DAGs in Sto de-duplicate redundant information.
- Accessibility for All Engineers:
- Performance insights shouldn't be confined to experts. Tools and systems should aim to make performance implications clear and actionable for every developer.
- Example: IDE feedback based on Sto data, simplified query interfaces.
- Historical Context is More Powerful Than Static Thresholds:
- Comparing current system behavior to its own past behavior, especially within specific contexts (like workload cohorts), often provides more meaningful signals of change than comparing against arbitrary, static thresholds.
- Efficiency at Scale Matters:
- Small performance changes or data storage/processing inefficiencies can have a massive cost and operational impact at scale.
- Example: Sto's data compression; Two Sigma's ability to process vast event streams.
- Look Beyond the Traditional "Three Pillars":
- While metrics, logs, and traces are essential, deep system understanding often requires richer data types (like detailed profiler data) and more sophisticated analytical approaches that can model complex behaviors.
- Understand Your Workloads Deeply:
- "Workloads matter." The inputs to your system and the different ways it's used are critical context for interpreting performance data. Avoid treating the system as a monolithic black box.