# virtual-frame
Deterministic data pipeline toolkit for LLM training data preparation. Written in Rust with Python bindings via PyO3.
## Features
- **TidyView** — Zero-copy virtual views over columnar data. Filters flip bits in a bitmask; selects narrow a projection map. The original data is never copied or modified. Chain operations freely with no allocation overhead.
- **NFA Regex** — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
- **NLP Primitives** — Levenshtein distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
- **Kahan Summation** — Compensated floating-point accumulation for deterministic results. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation.
- **SplitMix64 RNG** — Deterministic random number generator. Same seed = same sequence on every platform. Supports uniform f64, Box-Muller normal, and fork for independent substreams.
- **CSV Ingestion** — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that never materializes the full dataset.
- **Columnar Storage** — Typed column vectors (Int64, Float64, String, Bool) with zero-copy borrowed keys for group-by and join operations.
## Install
### Python
```bash
pip install virtual-frame
```
### Rust
```toml
[dependencies]
virtual-frame = "0.1"
```
## Quick Start (Python)
```python
import virtual_frame as vf
# Load data
df = vf.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"dept": ["eng", "eng", "sales", "sales"],
"salary": [90000.0, 85000.0, 70000.0, 75000.0],
})
# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize()) # 3 rows, no data copied
# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary")) # [87500.0, 72500.0]
# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]
# NLP
print(vf.levenshtein("kitten", "sitting")) # 3
print(vf.char_ngrams("hello", 2)) # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}
# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64()) # 0.741565... (same on every platform)
# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000)) # 1000000.0 (exact)
```
## Quick Start (Rust)
```rust
use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::{TidyView, TidyAgg, ArrangeKey};
use virtual_frame::expr::{col, binop, BinOp, DExpr};
let df = DataFrame::from_columns(vec![
("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();
let view = TidyView::new(df);
let filtered = view
.filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
.unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25
```
## Architecture
```
TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
```
- **BitMask**: One bit per row, packed into 64-bit words. A million-row filter costs 122 KB. Chained filters AND their bitmasks — still no data copied.
- **ProjectionMap**: Tracks visible column indices. Select narrows the map without touching data.
- **Ordering**: Lazy sort permutation. Only materialized when needed.
- **ColumnKeyRef**: Borrowed keys into column data for zero-allocation group-by and join index construction.
## Determinism Guarantee
All operations are deterministic: identical inputs produce bit-identical outputs regardless of platform. This is enforced by:
- Kahan compensated summation for all floating-point reductions
- BTreeMap/BTreeSet everywhere (never HashMap with random iteration)
- SplitMix64 RNG with explicit seed threading
- No FMA instructions in reduction paths
## License
MIT