llkv_column_map/
lib.rs

1//! Columnar storage engine for LLKV.
2//!
3//! This crate provides the low-level columnar layer that persists Apache Arrow
4//! [`RecordBatch`]es to disk and supports efficient scans, filters, and updates.
5//! It serves as the foundation for [`llkv-table`] and higher-level query
6//! execution.
7//!
8//! # Role in the Story
9//!
10//! The column map is where LLKV’s Arrow-first design meets pager-backed
11//! persistence. Every [`sqllogictest`](https://sqlite.org/sqllogictest/doc/trunk/about.wiki) shipped with SQLite—and an expanding set of
12//! DuckDB suites—ultimately routes through these descriptors and chunk walkers.
13//! The storage layer therefore carries the burden of matching SQLite semantics
14//! while staying efficient enough for OLAP workloads. Gaps uncovered by the
15//! logic tests are treated as defects in this crate, not harness exceptions.
16//!
17//! The engine is maintained in the open by a single developer. These docs aim
18//! to give newcomers the same context captured in the README and DeepWiki pages
19//! so the story remains accessible as the project grows.
20//!
21//! # Architecture
22//!
23//! The storage engine is organized into several key components:
24//!
25//! - **[`ColumnStore`]**: Primary interface for storing and retrieving columnar data.
26//!   Manages column descriptors, metadata catalogs, and coordinates with the pager
27//!   for persistent storage.
28//!
29//! - **[`LogicalFieldId`](types::LogicalFieldId)**: Namespaced identifier for columns.
30//!   Combines a namespace (user data, row ID shadow, MVCC metadata), table ID, and
31//!   field ID into a single 64-bit value to prevent collisions.
32//!
33//! - **[`ScanBuilder`]**: Builder pattern for constructing column scans with various
34//!   options (filters, ordering, row ID inclusion).
35//!
36//! - **Visitor Pattern**: Scans emit data through visitor callbacks rather than
37//!   materializing entire columns in memory, enabling streaming and aggregation.
38//!
39//! # Storage Model
40//!
41//! Data is stored in columnar chunks:
42//! - Each column is identified by a `LogicalFieldId`
43//! - Columns are broken into chunks for incremental writes
44//! - Each chunk stores Arrow-serialized data plus metadata (row count, min/max values)
45//! - Shadow columns track row IDs separately from user data
46//! - MVCC columns (`created_by`, `deleted_by`) track transaction visibility
47//!
48//! # Namespaces
49//!
50//! Columns are organized into namespaces to prevent ID collisions:
51//! - `UserData`: Regular table columns
52//! - `RowIdShadow`: Internal row ID tracking for each column
53//! - `TxnCreatedBy`: MVCC transaction that created each row
54//! - `TxnDeletedBy`: MVCC transaction that deleted each row
55//!
56//! # Test Coverage
57//!
58//! - **SQLite suites**: The storage layer powers every SQLite [`sqllogictest`](https://sqlite.org/sqllogictest/doc/trunk/about.wiki)
59//!   case that upstream publishes. Passing those suites provides a baseline for
60//!   SQLite compatibility, but LLKV still diverges from SQLite behavior in
61//!   places and should not be treated as a drop-in replacement yet.
62//! - **DuckDB extensions**: DuckDB-focused suites exercise MVCC edge cases and
63//!   typed transaction flows. Coverage is early and informs the roadmap rather
64//!   than proving full DuckDB parity today. All suites run through the
65//!   [`sqllogictest` crate](https://crates.io/crates/sqllogictest).
66//!
67//! # Thread Safety
68//!
69//! `ColumnStore` is thread-safe (`Send + Sync`) with internal locking for
70//! catalog updates. Read operations can occur concurrently; writes are
71//! serialized through the catalog lock.
72//!
73//! [`RecordBatch`]: arrow::record_batch::RecordBatch
74//! [`llkv-table`]: https://docs.rs/llkv-table
75//! [`ColumnStore`]: store::ColumnStore
76//! [`ScanBuilder`]: scan::ScanBuilder
77//!
78//! # Macros and Type Dispatch
79//!
80//! This crate provides macros for efficient type-specific operations without runtime
81//! dispatch overhead. See [`with_integer_arrow_type!`] for details.
82
83// NOTE: rustfmt currently re-indents portions of macro_rules! blocks in this
84// file (observed when running `cargo fmt`). This produces noisy diffs and
85// churn because rustfmt will flip formatting between runs. The problematic
86// locations in this module are the macro_rules! dispatch macros declared
87// below. Until the underlying rustfmt bug is fixed, we intentionally opt out
88// of automatic formatting for those specific macros using `#[rustfmt::skip]`,
89// while keeping the rest of the module formatted normally.
90//
91// Reproduction / debugging tips for contributors:
92// - Run `rustup run stable rustfmt -- --version` to confirm the rustfmt
93//   version, then `cargo fmt` to reproduce the behavior.
94// - Narrow the change by running rustfmt on this file only:
95//     rustfmt llkv-column-map/src/store/scan/unsorted.rs
96// - If you can produce a minimal self-contained example that triggers the
97//   re-indent, open an issue with rustfmt (include rustfmt version and the
98//   minimal example) and link it here.
99//
100// NOTE: Once a minimal reproducer for the rustfmt regression exists, link the
101// upstream issue here and remove the `#[rustfmt::skip]` attributes so the file
102// can return to standard formatting. Progress is tracked at
103// https://github.com/rust-lang/rustfmt/issues/6629#issuecomment-3395446770.
104
105/// Dispatches to type-specific code based on an Arrow `DataType`.
106///
107/// This macro eliminates runtime type checking by expanding to type-specific code
108/// at compile time. It matches the provided `DataType` against supported numeric types
109/// and binds the corresponding Arrow primitive type to the specified identifier.
110///
111/// # Parameters
112///
113/// - `$dtype` - Expression evaluating to `&arrow::datatypes::DataType`
114/// - `$ty` - Identifier to bind the Arrow primitive type to (e.g., `UInt64Type`)
115/// - `$body` - Code to execute with `$ty` bound to the matched type
116/// - `$unsupported` - Fallback expression if the type is not supported
117///
118/// # Performance
119///
120/// This macro is used in hot paths to avoid runtime `match` statements and virtual
121/// dispatch. The compiler generates specialized code for each type.
122#[macro_export]
123#[rustfmt::skip]
124macro_rules! with_integer_arrow_type {
125    ($dtype:expr, |$ty:ident| $body:expr, $unsupported:expr $(,)?) => {{
126        use std::borrow::Borrow;
127
128        let dtype_value = $dtype;
129        let dtype_ref: &arrow::datatypes::DataType = dtype_value.borrow();
130        let mut result: Option<_> = None;
131
132        macro_rules! __llkv_dispatch_integer_arrow_type {
133            (
134                        $base:ident,
135                        $chunk_fn:ident,
136                        $chunk_with_rids_fn:ident,
137                        $run_fn:ident,
138                        $run_with_rids_fn:ident,
139                        $array_ty:ty,
140                        $physical_ty:ty,
141                        $dtype_expr:expr,
142                        $native_ty:ty,
143                        $cast_expr:expr
144                    ) => {
145                if dtype_ref == &$dtype_expr {
146                    type $ty = $physical_ty;
147                    result = Some($body);
148                }
149            };
150        }
151
152        llkv_for_each_arrow_numeric!(__llkv_dispatch_integer_arrow_type);
153
154        result.unwrap_or_else(|| $unsupported)
155    }};
156}
157
158/// Invokes a macro for each supported Arrow numeric type.
159///
160/// This is a helper macro that generates repetitive type-specific code. It calls
161/// the provided macro once for each numeric Arrow type with metadata about that type.
162///
163/// # Macro Arguments Provided to Callback
164///
165/// For each type, the callback macro receives:
166/// 1. Base type name (e.g., `u64`, `i32`, `f64`)
167/// 2. Chunk visitor method name (e.g., `u64_chunk`)
168/// 3. Chunk with row IDs visitor method name (e.g., `u64_chunk_with_rids`)
169/// 4. Run visitor method name (e.g., `u64_run`)
170/// 5. Run with row IDs visitor method name (e.g., `u64_run_with_rids`)
171/// 6. Arrow array type (e.g., `arrow::array::UInt64Array`)
172/// 7. Arrow physical type (e.g., `arrow::datatypes::UInt64Type`)
173/// 8. Arrow DataType enum variant (e.g., `arrow::datatypes::DataType::UInt64`)
174/// 9. Native Rust type (e.g., `u64`)
175/// 10. Cast expression for type conversion
176#[macro_export]
177#[rustfmt::skip]
178macro_rules! llkv_for_each_arrow_numeric {
179    ($macro:ident) => {
180        $macro!(
181            u64,
182            u64_chunk,
183            u64_chunk_with_rids,
184            u64_run,
185            u64_run_with_rids,
186            arrow::array::UInt64Array,
187            arrow::datatypes::UInt64Type,
188            arrow::datatypes::DataType::UInt64,
189            u64,
190            |v: u64| v as f64
191        );
192        $macro!(
193            u32,
194            u32_chunk,
195            u32_chunk_with_rids,
196            u32_run,
197            u32_run_with_rids,
198            arrow::array::UInt32Array,
199            arrow::datatypes::UInt32Type,
200            arrow::datatypes::DataType::UInt32,
201            u32,
202            |v: u32| v as f64
203        );
204        $macro!(
205            u16,
206            u16_chunk,
207            u16_chunk_with_rids,
208            u16_run,
209            u16_run_with_rids,
210            arrow::array::UInt16Array,
211            arrow::datatypes::UInt16Type,
212            arrow::datatypes::DataType::UInt16,
213            u16,
214            |v: u16| v as f64
215        );
216        $macro!(
217            u8,
218            u8_chunk,
219            u8_chunk_with_rids,
220            u8_run,
221            u8_run_with_rids,
222            arrow::array::UInt8Array,
223            arrow::datatypes::UInt8Type,
224            arrow::datatypes::DataType::UInt8,
225            u8,
226            |v: u8| v as f64
227        );
228        $macro!(
229            i64,
230            i64_chunk,
231            i64_chunk_with_rids,
232            i64_run,
233            i64_run_with_rids,
234            arrow::array::Int64Array,
235            arrow::datatypes::Int64Type,
236            arrow::datatypes::DataType::Int64,
237            i64,
238            |v: i64| v as f64
239        );
240        $macro!(
241            i32,
242            i32_chunk,
243            i32_chunk_with_rids,
244            i32_run,
245            i32_run_with_rids,
246            arrow::array::Int32Array,
247            arrow::datatypes::Int32Type,
248            arrow::datatypes::DataType::Int32,
249            i32,
250            |v: i32| v as f64
251        );
252        $macro!(
253            i16,
254            i16_chunk,
255            i16_chunk_with_rids,
256            i16_run,
257            i16_run_with_rids,
258            arrow::array::Int16Array,
259            arrow::datatypes::Int16Type,
260            arrow::datatypes::DataType::Int16,
261            i16,
262            |v: i16| v as f64
263        );
264        $macro!(
265            i8,
266            i8_chunk,
267            i8_chunk_with_rids,
268            i8_run,
269            i8_run_with_rids,
270            arrow::array::Int8Array,
271            arrow::datatypes::Int8Type,
272            arrow::datatypes::DataType::Int8,
273            i8,
274            |v: i8| v as f64
275        );
276        $macro!(
277            f64,
278            f64_chunk,
279            f64_chunk_with_rids,
280            f64_run,
281            f64_run_with_rids,
282            arrow::array::Float64Array,
283            arrow::datatypes::Float64Type,
284            arrow::datatypes::DataType::Float64,
285            f64,
286            |v: f64| v
287        );
288        $macro!(
289            f32,
290            f32_chunk,
291            f32_chunk_with_rids,
292            f32_run,
293            f32_run_with_rids,
294            arrow::array::Float32Array,
295            arrow::datatypes::Float32Type,
296            arrow::datatypes::DataType::Float32,
297            f32,
298            |v: f32| v as f64
299        );
300        $macro!(
301            date64,
302            date64_chunk,
303            date64_chunk_with_rids,
304            date64_run,
305            date64_run_with_rids,
306            arrow::array::Date64Array,
307            arrow::datatypes::Date64Type,
308            arrow::datatypes::DataType::Date64,
309            i64,
310            |v: i64| v as f64
311        );
312        $macro!(
313            date32,
314            date32_chunk,
315            date32_chunk_with_rids,
316            date32_run,
317            date32_run_with_rids,
318            arrow::array::Date32Array,
319            arrow::datatypes::Date32Type,
320            arrow::datatypes::DataType::Date32,
321            i32,
322            |v: i32| v as f64
323        );
324    };
325}
326
327#[macro_export]
328#[rustfmt::skip]
329macro_rules! llkv_for_each_arrow_boolean {
330    ($macro:ident) => {
331        $macro!(
332            bool,
333            bool_chunk,
334            bool_chunk_with_rids,
335            bool_run,
336            bool_run_with_rids,
337            arrow::array::BooleanArray,
338            arrow::datatypes::BooleanType,
339            arrow::datatypes::DataType::Boolean,
340            bool,
341            |v: bool| if v { 1.0 } else { 0.0 }
342        );
343    };
344}
345
346#[macro_export]
347#[rustfmt::skip]
348macro_rules! llkv_for_each_arrow_string {
349    ($macro:ident) => {
350        $macro!(
351            utf8,
352            utf8_chunk,
353            utf8_chunk_with_rids,
354            utf8_run,
355            utf8_run_with_rids,
356            arrow::array::StringArray,
357            arrow::datatypes::Utf8Type,
358            arrow::datatypes::DataType::Utf8,
359            &str,
360            |_v: &str| 0.0
361        );
362    };
363}
364
365pub fn is_supported_arrow_type(dtype: &arrow::datatypes::DataType) -> bool {
366    use arrow::datatypes::DataType;
367
368    if matches!(dtype, DataType::Utf8 | DataType::LargeUtf8) {
369        return true;
370    }
371
372    let mut matched = false;
373
374    macro_rules! __llkv_match_dtype {
375        (
376            $base:ident,
377            $chunk_fn:ident,
378            $chunk_with_rids_fn:ident,
379            $run_fn:ident,
380            $run_with_rids_fn:ident,
381            $array_ty:ty,
382            $physical_ty:ty,
383            $dtype_expr:expr,
384            $native_ty:ty,
385            $cast_expr:expr
386        ) => {
387            if dtype == &$dtype_expr {
388                matched = true;
389            }
390        };
391    }
392
393    llkv_for_each_arrow_numeric!(__llkv_match_dtype);
394    llkv_for_each_arrow_boolean!(__llkv_match_dtype);
395
396    matched
397}
398
399pub fn supported_arrow_types() -> Vec<arrow::datatypes::DataType> {
400    use arrow::datatypes::DataType;
401
402    let mut types = vec![DataType::Utf8, DataType::LargeUtf8];
403
404    macro_rules! __llkv_push_dtype {
405        (
406            $base:ident,
407            $chunk_fn:ident,
408            $chunk_with_rids_fn:ident,
409            $run_fn:ident,
410            $run_with_rids_fn:ident,
411            $array_ty:ty,
412            $physical_ty:ty,
413            $dtype_expr:expr,
414            $native_ty:ty,
415            $cast_expr:expr
416        ) => {
417            types.push($dtype_expr.clone());
418        };
419    }
420
421    llkv_for_each_arrow_numeric!(__llkv_push_dtype);
422    llkv_for_each_arrow_boolean!(__llkv_push_dtype);
423
424    types
425}
426
427pub fn ensure_supported_arrow_type(dtype: &arrow::datatypes::DataType) -> Result<()> {
428    if is_supported_arrow_type(dtype) {
429        return Ok(());
430    }
431
432    let mut supported = supported_arrow_types()
433        .into_iter()
434        .map(|dtype| format!("{dtype:?}"))
435        .collect::<Vec<_>>();
436    supported.sort();
437    supported.dedup();
438
439    Err(Error::InvalidArgumentError(format!(
440        "unsupported Arrow type {dtype:?}; supported types are {}",
441        supported.join(", ")
442    )))
443}
444
445pub mod gather;
446pub mod parallel;
447pub mod store;
448pub mod types;
449
450pub use llkv_result::{Error, Result};
451pub use store::{
452    ColumnStore, IndexKind, ROW_ID_COLUMN_NAME,
453    scan::{self, ScanBuilder},
454};
455
456pub mod debug {
457    pub use super::store::debug::*;
458}