/Pydantic Logfire

Teaching DataFusion to read struct fields efficiently

10 mins

Matthew works on Fusionfire, our internal database that backs our observability platform Pydantic Logfire. He likes file formats.

In Logfire, we store logs as JSON columns, which has its own set of problems. JSON shredding in Parquet is unstandardized across implementations, and the overhead of parsing and validating JSON strings on every read adds up.

I've been driving Variant type support in DataFusion for a while now. Variant is an efficient binary encoding for JSON-like data that gives semi-structured values a native representation in columnar storage with a standardized shredding spec. It solves both the interoperability and the performance problem, and we're planning to migrate Logfire to Variant next quarter.

When Variant data is written to Parquet, it gets shredded into typed struct columns, which means every query on a variant field goes through DataFusion's struct handling code. Until recently, DataFusion couldn't efficiently query struct fields at all. Filters fell back to full scans, projections decoded every sibling leaf even when you only needed one, and row group pruning gave up entirely and kept every row group.

This post explains how Parquet stores structs, why querying them was slow, and how struct field pushdown makes them as efficient as top-level columns.

Before After
wide_struct / select_small_field 19.5x
8.2 ms
419 µs
wide_struct / sum_small_field 12.8x
8.2 ms
638 µs
wide_struct / select_one_string_field 12.3x
8.3 ms
671 µs
nested_struct / select_extra_string 4.9x
3.4 ms
685 µs
nested_struct / select_inner_small_field 2.0x
899 µs
450 µs
narrow_struct / select_small_field 1.5x
654 µs
422 µs

Parquet is a columnar format, meaning each column is stored independently as its own stream of pages. When you read a single column, you only touch the bytes for that column. This is what makes columnar storage fast for analytical queries that only need a few columns out of a wide table.

Struct columns complicate this because a struct has no physical column of its own in Parquet. Instead, its child fields each become separate leaf columns. Consider a table with this Arrow schema:

id: int64
s: struct
    name: string
    value: int32
    payload: binary
status: string

In Parquet, this becomes five leaf columns:

Leaf 0: id        (int64)
Leaf 1: s.name    (string)
Leaf 2: s.value   (int32)
Leaf 3: s.payload (binary)
Leaf 4: status    (string)

The struct s is purely a logical grouping. Physically, s.name, s.value, and s.payload are three independent column chunks, each with their own pages, encodings, and min/max statistics.

For top-level columns, DataFusion already knows how to be efficient. SELECT id FROM t WHERE status = 'active' only reads the id and status columns from Parquet, pushes the filter into the decoder so non-matching rows never get materialized, and uses min/max statistics to skip entire row groups where status can't possibly be 'active'. These three optimizations (projection pruning, filter pushdown, and row group pruning) are what make columnar storage fast.

None of them worked for struct fields. DataFusion treated the struct as an opaque unit, so SELECT s['value'] FROM t WHERE s['value'] > 10 would read every leaf of s, evaluate the filter after a full scan, and skip no row groups. If you only need s.value, a 4-byte integer per row, you're still decoding s.name and s.payload which could be arbitrarily large strings or blobs. The I/O and decode cost scales with the total size of the struct, not the size of the field you actually need.

Struct field pushdown teaches DataFusion to look through the struct and operate directly on the underlying leaf columns.

DataFusion pushes filter predicates into the Parquet reader so rows are filtered during decoding rather than after. For WHERE id > 100, the reader evaluates the predicate as it decodes and only emits matching rows. For WHERE s['value'] > 10, this wasn't happening because the PushdownChecker saw the column reference s, checked its type, found a Struct, and rejected it. It didn't consider that s['value'] resolves through the struct to a primitive int32 leaf.

All bracket notation, dot notation, and function-form access to struct fields compiles down to get_field expressions in DataFusion. The fix adds struct-aware handling that recognizes get_field resolves to a primitive leaf and allows it to be pushed down. Deeply nested access like s['outer']['inner'] also works because the logical simplifier flattens nested get_field calls before they reach the physical plan.

Even with filter pushdown working, the reader was still decoding all three leaf columns of s to evaluate a predicate on value. The reader interpreted "you need column s" as "read all leaves of s," so s['value'] > 10 was decoding s.name and s.payload for every row group even though the predicate never touches them. The same problem existed on the projection side, where SELECT s['value'] would read every sibling leaf.

The fix resolves struct field access expressions down to their specific Parquet leaf column indices and builds a ProjectionMask at the leaf level rather than the root level. A shared ParquetReadPlan abstraction handles this resolution for both the filter path and the projection path. Now SELECT s['value'] FROM t WHERE s['value'] > 10 only touches leaf 2 (s.value) for both filtering and output, skipping the string and binary columns entirely.

Parquet stores min/max statistics for every leaf column in the file metadata, including struct leaves. DataFusion's row group pruning system uses these statistics to skip entire row groups whose value ranges don't intersect the filter. For WHERE id > 1000, if a row group's max id is 500, the entire row group gets skipped without reading any data. But for WHERE s['value'] > 1000, the pruning system only understood top-level column references. When it encountered a get_field expression, it conservatively kept every row group.

The fix introduces PruningColumn, a column reference that carries an optional field path for nested struct access. The pruning expression builder recognizes get_field expressions, extracts the field path, and resolves it to the corresponding leaf column index in the Parquet schema. The existing min/max statistics for that leaf then feed into the same pruning logic that top-level columns use.

We benchmarked three dataset shapes, each with ~262K rows and 8KB string payloads to make the cost of reading unnecessary siblings visible. All benchmarks ran on a GKE c4a-highmem-16 instance (12 vCPU, 65 GiB, Neoverse-V2 aarch64).

The biggest wins come from wide structs where leaf-level projection skips decoding of large sibling columns:

Benchmark Before After Speedup
wide_struct/select_small_field 8.2 ms 419 µs 19.5x
wide_struct/sum_small_field 8.2 ms 638 µs 12.8x
wide_struct/select_one_string_field 8.3 ms 671 µs 12.3x
nested_struct/select_extra_string 3.4 ms 685 µs 4.9x
nested_struct/select_inner_small_field 899 µs 450 µs 2.0x
nested_struct/sum_inner_small_field 1.16 ms 662 µs 1.8x
narrow_struct/select_small_field 654 µs 422 µs 1.5x
narrow_struct/select_id_and_small_field 684 µs 445 µs 1.5x

wide_struct/select_small_field is the most telling benchmark. It selects a single 4-byte integer from a struct with many large string siblings. Before leaf-level projection, DataFusion decoded every sibling column (writing 611 MiB to disk), and the query took 8.2ms. After, it reads only the integer leaf (4.7 MiB to disk), and the query completes in 419µs. The speedup scales directly with how much unnecessary data the old code path was reading.

Narrow structs show smaller gains because the sibling columns are cheaper to decode. When the struct only has a few small fields, reading all of them instead of one doesn't cost much. The optimization matters most when struct fields have heterogeneous sizes, which is exactly what happens with shredded Variant data where a struct might contain a small typed value alongside large metadata or fallback columns.

The Parquet ecosystem is adopting Variant as a first-class type for semi-structured data. When Variant data is written to Parquet, it gets shredded into typed struct columns. A variant column containing mostly integers becomes a struct with a value leaf storing the int values and a metadata leaf storing type information.

Every query on a variant field goes through DataFusion's struct field access path. WHERE variant_col['price'] > 100 compiles to a get_field on the shredded struct. Without struct field pushdown, that query would decode every leaf, evaluate the filter after a full scan, and skip no row groups. With it, variant field access gets the same treatment as top-level column access across projections, filters, and row group pruning.

Struct field pushdown landed upstream in DataFusion and ships in Logfire. Once we complete the Variant migration next quarter, every query you run against a JSON attribute will benefit from the optimisations described above, with no change to your queries or instrumentation.

If you want to see how Logfire makes semi-structured observability data fast and queryable, start a free Logfire project or read the docs.