Optimizing Profiler Data Storage & Querying

Problem

Profiler data (from tools like perf, YourKit) is crucial for understanding application behavior but is extremely voluminous and highly redundant. This makes traditional storage and querying inefficient and costly.

Sto's ("Store Things Optimized") Solution

The DAG

  1. stack_node_data: Represents a unique code location (symbol, filename, line number). This has relatively low cardinality.
    • Fields: id (pk), line (int), file (text), symbol (text)
  2. executable: Represents a unique binary build (name, version). Also low cardinality.
    • Fields: id (pk), name (text), version (int), samples (int)
  3. stack_node: The vertices of the DAG. Each stack_node represents:
    • A specific stack_node_data (code location).
    • Occurring within a specific executable (binary build).
    • Having a specific parent_id (another stack_node, or null for roots).
    • An aggregated sample_count (how many times this exact path was observed).
    • ID is typically a hash of (parent_id, exe_id, data_id).

Benefits

Example

If a function doLogging in demo.c becomes 10x more expensive in a new version (version two) compared to an old one (version one), Sto's CLI can ingest profiles for both. A generic SQL query (e.g., findRegressions) can then compare sample_counts for the doLogging stack_node_data across the two executable.build_ids, clearly highlighting the regression.

Resources