virtual-frame 0.1.0

Deterministic data pipeline toolkit for LLM training — bitmask-filtered virtual views, NFA regex, Kahan summation, full audit trail. Python bindings included.
Documentation
# virtual-frame

Deterministic data pipeline toolkit for LLM training data preparation. Written in Rust with Python bindings via PyO3.

## Features

- **TidyView** — Zero-copy virtual views over columnar data. Filters flip bits in a bitmask; selects narrow a projection map. The original data is never copied or modified. Chain operations freely with no allocation overhead.
- **NFA Regex** — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
- **NLP Primitives** — Levenshtein distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
- **Kahan Summation** — Compensated floating-point accumulation for deterministic results. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation.
- **SplitMix64 RNG** — Deterministic random number generator. Same seed = same sequence on every platform. Supports uniform f64, Box-Muller normal, and fork for independent substreams.
- **CSV Ingestion** — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that never materializes the full dataset.
- **Columnar Storage** — Typed column vectors (Int64, Float64, String, Bool) with zero-copy borrowed keys for group-by and join operations.

## Install

### Python

```bash
pip install virtual-frame
```

### Rust

```toml
[dependencies]
virtual-frame = "0.1"
```

## Quick Start (Python)

```python
import virtual_frame as vf

# Load data
df = vf.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["eng", "eng", "sales", "sales"],
    "salary": [90000.0, 85000.0, 70000.0, 75000.0],
})

# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize())  # 3 rows, no data copied

# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary"))  # [87500.0, 72500.0]

# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]

# NLP
print(vf.levenshtein("kitten", "sitting"))  # 3
print(vf.char_ngrams("hello", 2))  # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}

# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64())  # 0.741565... (same on every platform)

# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000))  # 1000000.0 (exact)
```

## Quick Start (Rust)

```rust
use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::{TidyView, TidyAgg, ArrangeKey};
use virtual_frame::expr::{col, binop, BinOp, DExpr};

let df = DataFrame::from_columns(vec![
    ("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
    ("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();

let view = TidyView::new(df);
let filtered = view
    .filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
    .unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25
```

## Architecture

```
TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
```

- **BitMask**: One bit per row, packed into 64-bit words. A million-row filter costs 122 KB. Chained filters AND their bitmasks — still no data copied.
- **ProjectionMap**: Tracks visible column indices. Select narrows the map without touching data.
- **Ordering**: Lazy sort permutation. Only materialized when needed.
- **ColumnKeyRef**: Borrowed keys into column data for zero-allocation group-by and join index construction.

## Determinism Guarantee

All operations are deterministic: identical inputs produce bit-identical outputs regardless of platform. This is enforced by:

- Kahan compensated summation for all floating-point reductions
- BTreeMap/BTreeSet everywhere (never HashMap with random iteration)
- SplitMix64 RNG with explicit seed threading
- No FMA instructions in reduction paths

## License

MIT