/Pydantic Logfire

Simpler metric queries in Logfire

4 mins

Simpler metric queries in Logfire

Logfire lets you query all of your telemetry with plain SQL, and that has always included metrics. But until recently, "you can query it with SQL" came with an asterisk for histogram metrics.

In fairness, that asterisk stopped hurting as much once LLMs got good at SQL. Exponential histogram bucket math is exactly the kind of mechanical, well-documented transformation a model will happily write for you, and in practice that is how these queries got written: you ask a coding agent or the AI assistant in Logfire for a p95, and out comes something like this:

WITH indices AS (
    SELECT *, unnest(generate_series(1, array_length(exp_histogram_positive_bucket_counts, 1)::integer))
        AS bucket_index
    FROM metrics
    WHERE metric_name = 'http.server.duration'
),
counts AS (
    SELECT *, exp_histogram_positive_bucket_counts[bucket_index] AS bucket_count
    FROM indices
),
bounds AS (
    SELECT counts.*,
        greatest(histogram_min,
            power(2, (exp_histogram_positive_bucket_counts_offset + bucket_index - 1) / power(2.0, exp_histogram_scale))) AS lower_bound,
        least(histogram_max,
            power(2, (exp_histogram_positive_bucket_counts_offset + bucket_index) / power(2.0, exp_histogram_scale))) AS upper_bound
    FROM counts
    WHERE bucket_count > 0
),
bounds_with_index AS (
    SELECT unnest(generate_series(0, bucket_count - 1)) AS i, *
    FROM bounds
),
bucket_item_approx AS (
    SELECT
        CASE WHEN bucket_count = 1 THEN (lower_bound + upper_bound) / 2
            ELSE lower_bound + i * (upper_bound - lower_bound) / (bucket_count - 1)
        END AS approx_item, *
    FROM bounds_with_index
)
SELECT
    time_bucket('5 minutes', recorded_timestamp) AS x,
    approx_percentile_cont(approx_item, 0.95) AS p95
FROM bucket_item_approx
GROUP BY x
ORDER BY x

That query manually unpacks the bucket counts of an OpenTelemetry exponential histogram, reconstructs each bucket's boundaries from its scale and offset, approximates the individual observations, and only then computes a percentile. It works. We shipped it in our own standard dashboards. And because a model can produce it on demand, you could argue the problem was already solved.

We don't think so. Now you write this:

SELECT
    time_bucket('5 minutes', recorded_timestamp) AS x,
    metric_quantile(0.95, value) AS p95
FROM metrics
WHERE metric_name = 'http.server.duration'
GROUP BY x
ORDER BY x

And it is not just for histograms. Change the metric name to a gauge like process.cpu.utilization and the exact same query works. That is the whole point.

"An LLM can write the long version" is true, but it gets the economics backwards. The short version is better precisely because LLMs are now writing and reading most of these queries:

  • Fewer tokens to generate. Forty lines of bucket math cost real latency and money every time an agent writes them, and every generated line is another chance to slip on a detail (the scale, an off-by-one in the offset) that still returns plausible-looking numbers.
  • Less context to carry. Queries don't just get written once; they get saved into dashboards, alerts, and notebooks that agents later read back. A one-line query costs almost nothing in a context window. A 40-line CTE chain repeated across a dozen dashboard panels crowds out the things you actually want the model to think about.
  • Verifiable at a glance. A human reviewing an agent's work can check metric_quantile(0.95, value) instantly. Nobody is re-deriving exponential bucket boundaries in a code review.

The goal is not to replace the model. It is to give it (and you) a vocabulary where the obvious thing to write is short, cheap, and hard to get wrong.

The underlying problem is that the OpenTelemetry metrics data model stores different instrument types differently. Gauges and counters carry a single scalar reading. Explicit histograms carry bucket counts and boundaries. Exponential histograms carry bucket counts, a scale, and offsets. Our metrics table exposed each of these as separate flat columns (scalar_value, histogram_sum, exp_histogram_positive_bucket_counts, and so on), which meant every query had to know which kind of metric it was dealing with.

The new value column unifies all of them into a single struct, and a family of metric_* functions dispatches on the metric type internally:

Function What it computes
metric_quantile(p, value) An observation quantile (e.g. p95) across the group
metric_rate(value, ts) Per-second rate of change of a counter
metric_increase(value, ts) Total increase of a counter
metric_merge(value) Folds a group of data points into one value struct
metric_avg(value) A data point's average value
metric_sum(value) A data point's total
metric_count(value) The number of observations behind a data point
metric_min(value) / metric_max(value) The smallest / largest observation

The first four are aggregates. The rest operate per row and compose with regular SQL aggregates, so summary statistics look like this for any metric type:

SELECT
    time_bucket('5 minutes', recorded_timestamp) AS x,
    SUM(metric_sum(value)) / NULLIF(SUM(metric_count(value)), 0) AS average,
    MAX(metric_max(value)) AS worst
FROM metrics
WHERE metric_name = 'http.server.duration'
GROUP BY x

Counters get special treatment. OpenTelemetry counters come in two aggregation temporalities, delta and cumulative, and they need different math: deltas just sum, while cumulative series need reset detection. The value struct carries the temporality with the data, so metric_rate and metric_increase do the right thing per row without you writing WHERE aggregation_temporality = ... filters.

Before announcing this, we converted every metrics query we ship to the new functions: the standard dashboards (including the web server metrics dashboard that carried the 40-line monster above), the Hosts and Kubernetes pages, the Metrics Explorer, and the per-service metrics tab in the Live view.

That dogfooding caught real issues. The first implementation of metric_quantile was too slow on wide time ranges, so we made it much faster; the ported web server dashboard now loads consistently faster than the old hand-unpacked version. We also hit a genuine semantic edge: a single query spanning http.server.duration (emitted as an explicit histogram by some SDKs) and http.server.request.duration (an exponential histogram) cannot merge the two bucket layouts into one quantile estimate. In that case metric_quantile fails loudly instead of returning a wrong answer, and you add a metric_type filter or group by metric_type.

One user-visible win fell out for free: the Metrics Explorer now offers p50, p95, and p99 for histogram metrics. Observation percentiles on histograms simply were not possible in the UI before.

The functions are available now in Logfire, anywhere you write SQL: the Explore view, dashboards, alerts, and the query API. The old flat columns still exist and still work, so nothing breaks; the new functions are just the better way.

Pick a histogram metric your services already emit (any http.server.* duration metric is a good start), open Explore, and ask for a percentile in one line. The metrics querying docs have the full function reference and examples.