virtual-frame 0.1.1

Deterministic data pipeline toolkit for LLM training — bitmask-filtered virtual views, NFA regex, Kahan summation, full audit trail. Python bindings included.
Documentation
# virtual-frame

Foundational infrastructure for deterministic data pipelines. Written in Rust with Python bindings via PyO3.

virtual-frame provides low-level building blocks — columnar storage, bitmask-filtered views, compensated summation, NFA regex, string distance metrics, and a deterministic RNG — that can serve as a substrate for reproducible data processing, including LLM training data preparation. It is not a complete LLM data toolkit; it does not yet provide large-scale sharding, dedup pipelines, tokenizer-aware transforms, dataset versioning, provenance capture, or parallel ingestion. Those are the kinds of things you would build *on top of* this library.

## Features

- **TidyView** — Virtual views over columnar data. Filters flip bits in a packed bitmask (`BitMask`); selects narrow a projection index. The base `Rc<DataFrame>` is shared, not cloned, across chained operations. Note: `materialize()` does allocate when you need a concrete output DataFrame.
- **NFA Regex** — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
- **NLP Primitives** — Levenshtein edit distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
- **Kahan Summation** — Compensated floating-point accumulation. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation to reduce rounding drift. See [Limitations](#limitations) for precision boundaries.
- **SplitMix64 RNG** — Deterministic PRNG. Same seed produces the same sequence. Supports uniform f64, Box-Muller normal, and `fork()` for independent substreams. See [Determinism Design](#determinism-design) for what this does and does not guarantee.
- **CSV Ingestion** — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that operates in O(ncols) memory without materializing the full dataset.
- **Columnar Storage** — Typed column vectors (Int64, Float64, String, Bool) with borrowed `ColumnKeyRef` keys for group-by and join index construction, avoiding per-row string cloning.

## Install

### Python

```bash
pip install virtual-frame
```

### Rust

```toml
[dependencies]
virtual-frame = "0.1"
```

## Quick Start (Python)

```python
import virtual_frame as vf

# Load data
df = vf.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["eng", "eng", "sales", "sales"],
    "salary": [90000.0, 85000.0, 70000.0, 75000.0],
})

# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize())  # 3 rows

# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary"))  # [87500.0, 72500.0]

# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]

# NLP
print(vf.levenshtein("kitten", "sitting"))  # 3
print(vf.char_ngrams("hello", 2))  # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}

# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64())  # 0.741565...

# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000))  # 1000000.0
```

## Quick Start (Rust)

```rust
use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::TidyView;
use virtual_frame::expr::{col, binop, BinOp, DExpr};

let df = DataFrame::from_columns(vec![
    ("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
    ("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();

let view = TidyView::new(df);
let filtered = view
    .filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
    .unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25
```

## Architecture

```
TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
```

- **BitMask**: One bit per row, packed into 64-bit words. A million-row filter costs ~122 KB of bitmask memory. Chained filters AND their bitmasks together.
- **ProjectionMap**: Tracks visible column indices. `select()` narrows the map without touching column data.
- **Ordering**: Lazy sort permutation via `arrange()`. Only materialized into a concrete DataFrame when `materialize()` is called.
- **ColumnKeyRef**: Borrowed keys into column data for group-by and join index construction. Single-key operations use `BTreeMap<ColumnKeyRef, usize>` directly, avoiding one `Vec` allocation per row.

## Determinism Design

This library is designed for reproducible results through several mechanisms:

- **Kahan summation** for all floating-point reductions (sum, mean, variance, standard deviation)
- **BTreeMap/BTreeSet** everywhere — never `HashMap`/`HashSet`, which have non-deterministic iteration order
- **SplitMix64 RNG** with explicit seed threading
- **No reliance on FMA** in reduction paths (FMA can change rounding behavior across platforms)

**What this means in practice:** Given identical inputs and the same seed, all operations in this library should produce identical outputs. The test suite includes determinism checks that verify repeated execution yields the same results.

**What this does not yet prove:** Cross-platform bit-identity has not been validated with a CI matrix across Linux/macOS/Windows/ARM. The determinism properties are enforced by algorithm design (ordered containers, compensated summation, explicit RNG), but independent platform-pair verification is not yet published. If you depend on cross-platform reproducibility in production, you should validate on your target platforms.

## Limitations

- **Kahan summation precision boundary**: Single Kahan compensation captures one level of rounding error. For extreme cases (e.g., summing values where the accumulator and individual values differ by more than ~2^52), the compensation term itself can lose precision. The test suite validates Kahan accuracy for practical ranges (10M summands of 0.1). For cases requiring higher precision, consider second-order compensation or arbitrary-precision arithmetic.
- **Single-threaded**: All operations run on a single thread. There is no parallel ingestion or parallel group-by. This is a design choice (determinism is trivial without concurrency), but it means throughput is bounded by single-core speed.
- **No null/missing value support**: Columns are dense typed vectors with no NA/null sentinel. Missing data must be handled before loading.
- **No string interning**: String columns store owned `String` values. For datasets with high-cardinality string columns, memory usage may be higher than interned alternatives.
- **Python GIL bound**: The Python bindings hold the GIL during all operations. Long-running computations will block other Python threads.

## What This Is and Is Not

**This is** foundational infrastructure: columnar storage, filtered views, joins, grouped aggregation, regex extraction, string distance, tokenization, deterministic RNG, and compensated arithmetic. These are building blocks.

**This is not yet** a complete LLM data preparation toolkit. The harder problems in that space — large-scale deduplication, near-duplicate detection at corpus scale, tokenizer-aware transforms, dataset versioning and provenance, contamination checks, sharded parallel processing with preserved determinism — are not implemented. They could be built on top of this library's primitives, but they are not included today.

## Test Suite

95 tests covering all modules:

| Module | Tests | What's covered |
|---|---|---|
| bitmask | 6 | Word boundaries, AND, set iteration, memory sizing |
| column | 4 | Gather, length, borrowed keys, NaN ordering |
| dataframe | 4 | Construction, duplicates, length mismatch |
| expr | 2 | Row evaluation, columnar fast path |
| kahan | 3 | Compensation accuracy, determinism, count |
| regex_engine | 31 | Literals, classes, quantifiers, anchors, lazy/greedy, split, determinism |
| nlp | 17 | Edit distance, Jaccard, n-grams, tokenization, TF, cosine similarity |
| csv | 11 | Type inference, streaming, delimiters, line endings, max_rows |
| rng | 3 | Determinism, f64 range, fork independence |
| tidyview | 14 | Filter, chain, select, group-by, arrange, join, sample, distinct, snapshot semantics |

Run with: `cargo test`

## License

MIT