virtual-frame 0.1.0

Deterministic data pipeline toolkit for LLM training — bitmask-filtered virtual views, NFA regex, Kahan summation, full audit trail. Python bindings included.
Documentation
  • Coverage
  • 77.73%
    164 out of 211 items documented0 out of 121 items with examples
  • Size
  • Source code size: 178.25 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 2.38 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 44s Average build duration of successful builds.
  • all releases: 47s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • AdamEzzat1/virtual-frame
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • AdamEzzat1

virtual-frame

Deterministic data pipeline toolkit for LLM training data preparation. Written in Rust with Python bindings via PyO3.

Features

  • TidyView — Zero-copy virtual views over columnar data. Filters flip bits in a bitmask; selects narrow a projection map. The original data is never copied or modified. Chain operations freely with no allocation overhead.
  • NFA Regex — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
  • NLP Primitives — Levenshtein distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
  • Kahan Summation — Compensated floating-point accumulation for deterministic results. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation.
  • SplitMix64 RNG — Deterministic random number generator. Same seed = same sequence on every platform. Supports uniform f64, Box-Muller normal, and fork for independent substreams.
  • CSV Ingestion — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that never materializes the full dataset.
  • Columnar Storage — Typed column vectors (Int64, Float64, String, Bool) with zero-copy borrowed keys for group-by and join operations.

Install

Python

pip install virtual-frame

Rust

[dependencies]
virtual-frame = "0.1"

Quick Start (Python)

import virtual_frame as vf

# Load data
df = vf.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["eng", "eng", "sales", "sales"],
    "salary": [90000.0, 85000.0, 70000.0, 75000.0],
})

# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize())  # 3 rows, no data copied

# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary"))  # [87500.0, 72500.0]

# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]

# NLP
print(vf.levenshtein("kitten", "sitting"))  # 3
print(vf.char_ngrams("hello", 2))  # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}

# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64())  # 0.741565... (same on every platform)

# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000))  # 1000000.0 (exact)

Quick Start (Rust)

use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::{TidyView, TidyAgg, ArrangeKey};
use virtual_frame::expr::{col, binop, BinOp, DExpr};

let df = DataFrame::from_columns(vec![
    ("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
    ("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();

let view = TidyView::new(df);
let filtered = view
    .filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
    .unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25

Architecture

TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
  • BitMask: One bit per row, packed into 64-bit words. A million-row filter costs 122 KB. Chained filters AND their bitmasks — still no data copied.
  • ProjectionMap: Tracks visible column indices. Select narrows the map without touching data.
  • Ordering: Lazy sort permutation. Only materialized when needed.
  • ColumnKeyRef: Borrowed keys into column data for zero-allocation group-by and join index construction.

Determinism Guarantee

All operations are deterministic: identical inputs produce bit-identical outputs regardless of platform. This is enforced by:

  • Kahan compensated summation for all floating-point reductions
  • BTreeMap/BTreeSet everywhere (never HashMap with random iteration)
  • SplitMix64 RNG with explicit seed threading
  • No FMA instructions in reduction paths

License

MIT