rgwml 2.0.0

Typed, local-first tabular data library with columnar in-memory storage.
Documentation

rgwml

I made the first version of rgwml far too ambitious. It tried to be a Rust-flavoured Pandas clone, a Python bridge layer, an ML wrapper, a database client, and an AI glue crate at the same time. It got downloads, but the shape was wrong.

This repo is the correction.

rgwml is now a local-first tabular data library built around a typed, columnar in-memory model. No hidden Python. No network glue. No home-directory writes. No install-time side effects. Just tables, local files, and a memory model that has a real chance of staying lean.

What rgwml is now

  • A typed table core with columnar storage
  • A local CSV reader and writer
  • A small query surface for filter, sort_by, join, group_by, and materialize
  • A memory benchmark harness for comparing the new engine against a row-of-strings baseline and pandas

What I removed on purpose

  • OpenAI helpers
  • XGBoost wrappers
  • clustering helpers
  • database clients
  • Google Sheets and public URL loaders
  • shelling out to Python scripts
  • hidden filesystem writes during library init

That old scope was bloated and dishonest for a Rust library. The new build is intentionally narrower.

Current state

What works today:

  • CSV read and write
  • schema inference for bool, i64, f64, and string columns
  • opt-in date and timestamp inference for ISO-like CSV values
  • dictionary encoding for low-cardinality string columns
  • bitmap-backed null tracking
  • real pretty rendering for schema and preview rows
  • filter, sort, join, group-by, and materialize on the new table core
  • benchmark tooling

What is not done yet:

  • more local file readers once they are real and worth supporting

Date And Timestamp Policy

If you enable date_inference, rgwml currently recognizes dates in this format:

  • YYYY-MM-DD

If you enable timestamp_inference, rgwml currently recognizes:

  • RFC3339 / ISO 8601 timestamps with a timezone or trailing Z
  • YYYY-MM-DD HH:MM:SS[.fraction]
  • YYYY-MM-DDTHH:MM:SS[.fraction]
  • YYYY-MM-DD HH:MM
  • YYYY-MM-DDTHH:MM

Policy details:

  • date-only values infer Date32, not TimestampMs
  • date-only values are accepted for explicitly typed TimestampMs columns and normalize to midnight UTC
  • anything outside those formats stays string data unless you provide an explicit schema

Data model

The old repo forced everything into String. That was one of the worst design decisions in the whole crate.

The new core stores data by column:

  • fixed-width types use contiguous Vec<T>
  • strings use offsets + bytes
  • repeated strings can use dictionary encoding with integer keys
  • nulls live in bitmaps instead of Vec<Option<T>>
  • row selections can stay as views until you call materialize()

That is the minimum respectable shape if the goal is to beat a Vec<Vec<String>> model and stay competitive with pandas on memory.

Example: Read, Filter, Sort, Write

use rgwml::{
    read_csv, write_csv, ColumnSelector, CompareOp, CsvReadOptions, CsvWriteOptions, Literal,
    NullOrder, Predicate, SortKey, SortOrder,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let table = read_csv("sales.csv", &CsvReadOptions::default())?;

    let filtered = table.filter(&Predicate::Comparison {
        column: ColumnSelector::from("revenue"),
        op: CompareOp::Gt,
        value: Some(Literal::F64(100.0)),
    })?;

    let sorted = filtered
        .sort_by(&[SortKey {
            column: ColumnSelector::from("segment"),
            order: SortOrder::Ascending,
            nulls: NullOrder::Last,
        }])?
        .materialize()?;

    write_csv(&sorted, "sales.filtered.csv", &CsvWriteOptions::default())?;
    Ok(())
}

Example: Group And Aggregate

use std::sync::Arc;

use rgwml::{
    read_csv, AggregateExpr, AggregateOp, ColumnSelector, CsvReadOptions, GroupKey,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let table = read_csv("sales.csv", &CsvReadOptions::default())?;

    let grouped = table.group_by(
        &[GroupKey {
            column: ColumnSelector::from("segment"),
        }],
        &[
            AggregateExpr {
                input: None,
                op: AggregateOp::CountRows,
                alias: Arc::from("rows"),
            },
            AggregateExpr {
                input: Some(ColumnSelector::from("revenue")),
                op: AggregateOp::Sum,
                alias: Arc::from("revenue_sum"),
            },
            AggregateExpr {
                input: Some(ColumnSelector::from("revenue")),
                op: AggregateOp::Mean,
                alias: Arc::from("revenue_mean"),
            },
        ],
    )?;

    println!("groups: {}", grouped.nrows());
    Ok(())
}

Memory Benchmarks

There is a dedicated benchmark binary in this repo:

cargo run --bin bench_memory -- mixed 20000
cargo run --bin bench_memory -- low_card_strings 50000

Current local runs:

Case v2 peak delta Row strings peak delta Pandas peak delta
mixed 20000 5.5MB 11.6MB 9.0MB
low_card_strings 50000 1.5MB 15.0MB 9.8MB

The low-cardinality string case is the one I care about most right now. That is where the old String-everywhere model was especially dumb, and where dictionary encoding should earn its keep.

There is also an ops benchmark for the actual table kernels:

cargo run --bin bench_ops -- 200000

That benchmark times:

  • filter plus materialize
  • group-by with count, sum, and mean
  • inner join on typed key columns

If pandas is installed, bench_ops will compare against a small pandas script too.

Build And Verify

cargo test
cargo clippy --all-targets -- -D warnings
cargo run --bin bench_memory -- mixed 20000
cargo run --bin bench_ops -- 200000

Repo Status

This repo is the 2.0.0 break from the old over-scoped crate.

So the honest way to read this project is:

  • 1.x history: over-scoped legacy rgwml
  • 2.0.0: typed local-first rebuild with columnar storage, real joins, and no Python side effects
  • current work: tighten the kernels and keep the benchmark story honest

If you were using the old Python-heavy “do everything” version, this README is no longer describing that crate. That is deliberate.