virtual-frame
Deterministic data pipeline toolkit for LLM training data preparation. Written in Rust with Python bindings via PyO3.
Features
- TidyView — Zero-copy virtual views over columnar data. Filters flip bits in a bitmask; selects narrow a projection map. The original data is never copied or modified. Chain operations freely with no allocation overhead.
- NFA Regex — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
- NLP Primitives — Levenshtein distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
- Kahan Summation — Compensated floating-point accumulation for deterministic results. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation.
- SplitMix64 RNG — Deterministic random number generator. Same seed = same sequence on every platform. Supports uniform f64, Box-Muller normal, and fork for independent substreams.
- CSV Ingestion — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that never materializes the full dataset.
- Columnar Storage — Typed column vectors (Int64, Float64, String, Bool) with zero-copy borrowed keys for group-by and join operations.
Install
Python
Rust
[]
= "0.1"
Quick Start (Python)
# Load data
=
# Create a virtual view and chain operations
=
=
# 3 rows, no data copied
# Group and summarise
=
# [87500.0, 72500.0]
# NFA regex
# [(6, 8), (14, 15), (20, 23)]
# NLP
# 3
# {'el': 1, 'he': 1, 'll': 1, 'lo': 1}
# Deterministic RNG
=
# 0.741565... (same on every platform)
# Kahan-compensated sum
# 1000000.0 (exact)
Quick Start (Rust)
use DataFrame;
use Column;
use ;
use ;
let df = from_columns.unwrap;
let view = new;
let filtered = view
.filter
.unwrap;
assert_eq!; // rows where y > 25
Architecture
TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
- BitMask: One bit per row, packed into 64-bit words. A million-row filter costs 122 KB. Chained filters AND their bitmasks — still no data copied.
- ProjectionMap: Tracks visible column indices. Select narrows the map without touching data.
- Ordering: Lazy sort permutation. Only materialized when needed.
- ColumnKeyRef: Borrowed keys into column data for zero-allocation group-by and join index construction.
Determinism Guarantee
All operations are deterministic: identical inputs produce bit-identical outputs regardless of platform. This is enforced by:
- Kahan compensated summation for all floating-point reductions
- BTreeMap/BTreeSet everywhere (never HashMap with random iteration)
- SplitMix64 RNG with explicit seed threading
- No FMA instructions in reduction paths
License
MIT