Matthew works on Fusionfire, our internal database that backs our observability platform Logfire. He likes file formats.
Last month, I spent our quarterly hackathon implementing Hilbert curve sorting in Fusionfire. The goal was to speed up queries by replacing our lexicographic sort order with one that preserves locality across multiple columns at once, improving row group pruning in the process.
Unsurprisingly, our data wasn't curvy enough. Benchmarks showed Hilbert curve sorting regressed query performance and row group pruning.
Why sort order matters
We write Parquet files, and our query engine prunes row groups by checking min/max statistics against incoming filters. Sort order directly controls how tight those statistics are. With a lexicographic sort on columns (A, B, C), A gets near-perfect ranges, B gets decent ranges within each A-group, and every column after that gets progressively worse.
Rows sorted lexicographically on (A, B, C) and split into two row groups:
Row Group 1
| Row | A (env) | B (kind) | C (service) |
|---|---|---|---|
| 1 | production | Error | api |
| 2 | production | Error | auth |
| 3 | production | Info | api |
| 4 | production | Info | billing |
Row Group 2
| Row | A (env) | B (kind) | C (service) |
|---|---|---|---|
| 5 | staging | Error | api |
| 6 | staging | Info | auth |
| 7 | staging | Warn | api |
| 8 | staging | Warn | billing |
| A (env) | B (kind) | C (service) | |
|---|---|---|---|
| RG 1 min/max | production, production | Error, Info | api, billing |
| RG 2 min/max | staging, staging | Error, Warn | api, billing |
A has single-value ranges in both row groups. B spans a few values. C spans the full range in every row group. A filter on A prunes perfectly. A filter on C prunes nothing.
Our query patterns
Logfire queries vary widely in what they filter on. A user watching the live view might want all error spans for a specific service in the past 10 minutes. A dashboard might track the error count of a single endpoint over 30 days. And the explore view is a full SQL editor where users write whatever they want, filtering on any combination of columns.
Here's a simplified version of the schema we use in production:
deployment_environment: the environment, such asproductionorstagingkind: the log level, such asInfo,Warn, orErrorservice_name: the service that emitted the spanstart_timestamp: when the span startedtrace_id: a ULID that identifies the tracespan_name: the human-readable span message
Most of our queries filter on deployment_environment, kind, or service_name, so we maintain a lexicographic sort on those three columns, ordered by increasing cardinality. We don't include start_timestamp in the sort key because we already partition files by day. Each file covers a single day, so the timestamp range within a file is naturally bounded. The partition scheme gives us coarse-grained time pruning at the file level, and the lexicographic sort handles the fine-grained pruning within each file.
This works well for the columns in the sort key, but the benefit is entirely front-loaded. Leading columns get tight ranges, trailing columns get progressively less, and columns outside the sort key are effectively unsorted. For queries that filter on trace_id or span_name, row group statistics are useless and nothing gets pruned. Hilbert curves are one attempt to spread that benefit more evenly across all columns.
Hilbert curves
Lexicographic sorting linearizes multi-dimensional data by prioritizing one column at a time. Two rows that are close in both service_name and start_timestamp might land in completely different row groups because they differ on deployment_environment. The sort preserves locality in the leading columns and destroys it everywhere else.
A Hilbert curve fixes this by mapping all of a row's column values into a single sort key that tries to keep multi-dimensionally nearby points close together in the linear order. It does this better than alternatives like Z-order (Morton) curves, which suffer from discontinuous jumps at certain boundaries.
The idea applied to databases is straightforward. Take each row's column values, treat them as coordinates in an N-dimensional space, compute the row's position along the Hilbert curve, and sort by that position. Instead of one column dominating the layout, all columns contribute to the ordering simultaneously. Row groups end up containing rows that are nearby in every dimension, not just the first.
| env | kind | service | Hilbert index |
|---|---|---|---|
| production | Error | api | 5 |
| production | Info | auth | 7 |
| staging | Error | api | 12 |
| production | Warn | billing | 18 |
| staging | Info | api | 21 |
| staging | Warn | auth | 26 |
Rows are sorted by their Hilbert index. No single column dominates the order.
Another great property of the Hilbert curve is its stability across dimensions. When you add a new dimension, values that were close together in fewer dimensions remain close. Pick a color range in the gif below and watch how it stays spatially clustered as the curve's dimensionality changes. For us, this means we can introduce new columns into our sort key without reshuffling existing data in a way that destroys the clustering we already have.
This is not a new idea. Databricks moved from Z-order to Hilbert curve clustering for Delta Lake. Apache Hudi reported up to 11x query speedups with Hilbert ordering on certain workloads. The appeal is straightforward. If nearby multi-dimensional points land in the same row groups, min/max statistics tighten across all columns at once.
But there is an important counterpoint. Lemire and Kaser showed in their 2011 paper "Reordering Columns for Smaller Indexes" that lexicographic sorting with columns ordered by increasing cardinality is asymptotically optimal for column-oriented storage.
The metric they optimize is column runs, consecutive stretches of the same value within a single column. Longer runs mean tighter min/max ranges per row group and better compression. Lexicographic sorting creates very long runs in the leading columns because it holds them constant while varying the later ones. Hilbert sorting does the opposite, distributing ordering effort evenly across all dimensions and fragmenting the runs in every column. Their experiments found Hilbert curves produced roughly 2x more column runs than lexicographic sorting on real datasets.
The systems where Hilbert sorting shines tend to share a few properties. Columns have comparable cardinalities, there is no natural partition key already handling one dimension, and query patterns filter on many columns with roughly equal frequency. Databricks and Hudi see these conditions because their users run diverse, ad-hoc workloads over wide tables with many medium-cardinality columns.
Our data might not look like that, but the only way to know was to test it.
The experiment
I built a benchmarking binary that reads real production Parquet files and rewrites the same data twice, once with our current lexicographic sort and once with Hilbert curve sort, using identical row group sizes, writer config, and data so the only variable is the sort order.
To compute the Hilbert key, each row's columns are mapped to u32 coordinates. Timestamps get split into high and low 32-bit halves. String columns are hashed. Ordinals are scaled to the u32 range. These coordinates are fed through the Hilbert transform to produce a single sort key, which is used for ordering and then dropped before writing to Parquet.
I tested 13 configurations across three strategies:
Pure Hilbert. All columns as Hilbert dimensions with no pre-clustering, the most aggressive approach where every column contributes equally to the sort order.
Hybrid (cluster by env + kind). First group rows by deployment_environment and kind, then apply Hilbert ordering within each group. This preserves perfect pruning on the two lowest-cardinality columns and lets Hilbert handle the rest.
Hybrid3 (cluster by env + kind + service_name). Same idea, but clustering on three columns. Hilbert only handles the remaining dimensions like start_timestamp, span_name, and trace_id.
For each configuration I measured three things:
- Zone map tightness. The min/max range span per row group for each column.
- Row group pruning. How many row groups survive across 28 different filter combinations, from single-column filters to multi-column predicates with narrow and wide time windows.
- End-to-end query latency. 30 SQL queries run through DataFusion with Criterion benchmarks.
Results
Pure Hilbert was worse across the board. Without clustering, timestamp ranges per row group blew up from a median of 50.3% to 65.0% of the global range. The lexicographic sort produces row groups where deployment_environment has a single value 88.9% of the time and kind 67.8% of the time. Pure Hilbert dropped those to 20.0% and 46.7% respectively. Row group pruning regressed on 25 out of 28 filter combinations, with the worst case being ts+env+kind+svc where surviving row groups jumped from 15.6% to 64.4%, a 314% regression.
Hybrid clustering helped, but not enough. Clustering by deployment_environment and kind before applying Hilbert ordering preserved pruning on those two columns. But within each cluster, Hilbert scattered start_timestamp values across row groups and service_name single-value row groups dropped from 17.8% to 2.2%. On a narrow timestamp query filtered by env, kind, and service, the hybrid approach let 30.0% of row groups through compared to 15.6% under lexicographic sort, a 93% regression.
Hybrid3 was the closest. Clustering on all three sort key columns and applying Hilbert only to the remaining dimensions produced results closest to the lexicographic baseline. Zone map statistics for deployment_environment, kind, and service_name were identical to lexicographic sort. But timestamp pruning still regressed by 9-14% across predicate combinations, and the most selective multi-column query (ts+env+kind+svc) went from 15.6% to 20.0% of row groups surviving.
File size was a surprise. Hybrid3 actually produced 20% smaller files (838MB vs 1043M😎, likely because clustering on all three low-cardinality columns preserved dictionary encoding while Hilbert ordering within each cluster happened to group similar values. Pure Hilbert and Hybrid showed negligible size changes (1-2%).
The pattern was consistent across all configurations. Every step toward more Hilbert dimensions traded pruning on the columns that matter most for marginal improvements on columns that are either already partitioned or too high-cardinality to benefit.
Row group pruning by predicate
The following chart shows the percentage of row groups that survived (were not pruned) for a selection of representative predicates. Lower is better.
Zone map quality
The percentage of row groups where a column has a single distinct value (min equals ma😆. Higher is better, as it means the column can be pruned perfectly for that row group.
Why it didn't work for us
Our data has extreme cardinality skew. deployment_environment has a handful of values, kind maybe five, and service_name a few dozen, but then there is a cliff where start_timestamp and trace_id have effectively unique values per row. Hilbert curves distribute ordering effort evenly across all dimensions, but our data doesn't need even distribution.
The columns that matter most for pruning are already well-served by lexicographic sorting, and the columns that Hilbert could theoretically help with are either handled by day partitioning or too high-cardinality for any sort order to produce tight ranges within a single row group. Lemire and Kaser's results point to the same conclusion: lexicographic sorting with increasing cardinality concentrates its benefit where it matters most, while Hilbert sorting spreads it thin. When your cardinality distribution is skewed, concentration wins.
Conclusion
The lexicographic sort we already had was close to optimal for our data distribution. Hilbert curves solve a real problem, but it's not ours. The conditions where Hilbert sorting helps, comparable column cardinalities, no natural partition key, diverse multi-column filter patterns, don't describe our workload.
Negative results are still worth sharing. The line between "Hilbert sorting gave us 11x speedups" and "Hilbert sorting made everything worse" comes down to the shape of your data. If your columns have comparable cardinalities and your queries filter on many of them simultaneously, Hilbert curves are worth investigating. If your cardinality distribution is skewed and your partition scheme already handles the high-cardinality dimension, save yourself a hackathon.