llkv-column-map 0.8.5-alpha

Short answer: **yes—in the right workloads you can beat Parquet/Arrow** (or at least the default “zone-map + scan” path many engines use), and **SUM/COUNT/AVG can be very fast** if you add a proper vectorized scanner and a couple of cheap pre-aggregates.

# When you can beat Parquet/Arrow

Parquet (on-disk) and Arrow (in-memory layout) excel at **throughput scanning** with SIMD + compression. They *don’t* typically maintain a per-page/per-chunk **value-sorted permutation**. With your design:

* **Selective point/range filters** on a column:

  * Use segment **value min/max** to prune → touch few segments.
  * In each touched segment, **binary search** the value-order permutation → jump directly to the matches instead of scanning the whole page.
  * Complexity \~ **O(s·log n + k)** (s = segments kept, n = rows/segment, k = hits).
    In these cases you often read **far fewer bytes** and do less CPU than Parquet/Arrow’s “decompress & scan the page”, so you can win.

* **Append-only, sealed segments**: you pay the `O(n log n)` value sort once per segment at ingest, not at query time.

You won’t win on:

* **Large/full scans** or **very wide projections** with light filtering. Parquet/Arrow with vectorized scan + SIMD + column compression are hard to beat; just scan linearly.

# Aggregations (SUM/COUNT/AVG)

You can make aggregations fast with two layers:

## 1) Segment pre-aggregates (computed at seal time)

Store tiny stats per segment per column:

* `count_non_null`, `sum` (128-bit accumulator for integers), `min`, `max`.
* Optional `sum_squares` if you ever want `STDDEV` cheaply.

Then at query time:

* If a whole segment passes the predicate (e.g., **no filter** or filter fully contains the segment’s value range), you **accumulate the segment stats directly** (no data read).
  → For SUM/COUNT/AVG over large ranges this collapses to **O(#segments)** time.

## 2) Vectorized partial scanning for the rest

For segments partially passing the filter:

* Use your **value-order permutation** to find the qualifying span (binary search lower/upper).
* Pull that contiguous run from the **data blob** and run a **SIMD/branchless** loop:

  * Fixed-width numeric: decode straight from the blob; accumulate in 128-bit (integers) or f64 (float).
  * Variable-width numeric (if you have): use the offsets to slice; decode and accumulate.
* Apply **late materialization**: build a bitmap/selection first, then aggregate only selected vectors.

This mirrors what DuckDB/ClickHouse do, just with your extra seeking superpower.

### Optional turbo for value-range SUM

If range-SUM by value is hot on large segments of numeric columns, consider an **optional** per-segment prefix sum over the *value-sorted* order:

* `prefix[i] = sum(values in ord[0..i])` (u128).
* Range-SUM becomes two array reads + a subtract.
* Cost: \~8·n bytes per segment → \~512 KiB for 65 k rows; make it **opt-in**.

# Why this can beat Parquet/Arrow sometimes

* Parquet typically does: prune row groups/pages via min/max → **decompress & scan page**. No value-order seek.
* You do: prune → **seek exact subrange by binary search** → aggregate only those bytes.
  When selectivity is decent (or you can satisfy big chunks via pre-aggregates), you move less data and do less work.

# Implementation checklist (pragmatic)

* **On seal (per segment & column):**

  * Keep `value_min/value_max` in `IndexSegmentRef`.
  * Build `value_order` (`ord: Vec<u16|u32>`) by comparing slices via `ValueLayout`.
  * Compute `count`, `sum`, `min`, `max` once; persist with the segment metadata.
  * (Optional) `prefix_sum` over value order (opt-in).

* **Scanner API sketch:**

  ```rust
  struct AggResult { count: u64, sum_i128: i128, sum_f64: f64 /* … */ }

  fn agg_range_by_value(
      store: &ColumnStore<…>,
      fid: LogicalFieldId,
      lower: Option<&[u8]>,
      upper: Option<&[u8]>,
      typ: NumericType,             // e.g., I64, F64
  ) -> AggResult {
      // 1) prune segments by value_min/value_max (inclusive)
      // 2) batch-get typed segments + needed data blobs
      // 3) for each segment:
      //    - if fully inside range -> use pre-aggregates
      //    - else:
      //        a) binary search lower/upper in value_order with a slice comparator
      //        b) vector loop over ord[lo..hi): decode & accumulate
      // 4) combine partials; AVG = sum / count
  }
  ```

  Use wide batches, reuse scratch buffers, and prefer **branchless** tight loops for the inner accumulate.

* **Numerics care:**

  * Integers: accumulate in **i128/u128** to avoid overflow, downcast at the end if needed.
  * Floats: accumulate in f64; consider Kahan or pairwise reduction for big runs.

* **SIMD & cache:**

  * Fixed-width: the hot loop is a strided load over the blob; prefetch ahead.
  * Variable-width: follow `value_offsets` to contiguous slices; still fairly cache-friendly when the qualifying run is contiguous in `ord`.

# Bottom line

* **Yes, you can win** over Parquet/Arrow on **selective** filters and many range aggregates because you *seek* instead of *scan*.
* For **full scans** and **very wide projections**, match their **vectorized scanning** to be competitive; you won’t beat highly tuned engines there without similar techniques.
* Adding tiny **segment pre-aggregates** gives you near-O(#segments) time for big aggregates; pairing that with your **value-order seek** makes SUM/COUNT/AVG legitimately fast.