rgwml

I made the first version of rgwml far too ambitious. It tried to be a Rust-flavoured Pandas clone, a Python bridge layer, an ML wrapper, a database client, and an AI glue crate at the same time. It got downloads, but the shape was wrong.

This repo is the correction.

rgwml is now a local-first tabular data library built around a typed, columnar in-memory model. No hidden Python. No network glue. No home-directory writes. No install-time side effects. Just tables, local files, and a memory model that has a real chance of staying lean.

What `rgwml` is now

A typed table core with columnar storage
A local CSV reader and writer
A small query surface for filter, sort_by, join, group_by, and materialize
A memory benchmark harness for comparing the new engine against a row-of-strings baseline and pandas

What I removed on purpose

OpenAI helpers
XGBoost wrappers
clustering helpers
database clients
Google Sheets and public URL loaders
shelling out to Python scripts
hidden filesystem writes during library init

That old scope was bloated and dishonest for a Rust library. The new build is intentionally narrower.

Current state

What works today:

CSV read and write
schema inference for bool, i64, f64, and string columns
opt-in date and timestamp inference for ISO-like CSV values
dictionary encoding for low-cardinality string columns
bitmap-backed null tracking
real pretty rendering for schema and preview rows
filter, sort, join, group-by, and materialize on the new table core
benchmark tooling

What is not done yet:

more local file readers once they are real and worth supporting

Date And Timestamp Policy

If you enable date_inference, rgwml currently recognizes dates in this format:

YYYY-MM-DD

If you enable timestamp_inference, rgwml currently recognizes:

RFC3339 / ISO 8601 timestamps with a timezone or trailing Z
YYYY-MM-DD HH:MM:SS[.fraction]
YYYY-MM-DDTHH:MM:SS[.fraction]
YYYY-MM-DD HH:MM
YYYY-MM-DDTHH:MM

Policy details:

date-only values infer Date32, not TimestampMs
date-only values are accepted for explicitly typed TimestampMs columns and normalize to midnight UTC
anything outside those formats stays string data unless you provide an explicit schema

Data model

The old repo forced everything into String. That was one of the worst design decisions in the whole crate.

The new core stores data by column:

fixed-width types use contiguous Vec<T>
strings use offsets + bytes
repeated strings can use dictionary encoding with integer keys
nulls live in bitmaps instead of Vec<Option<T>>
row selections can stay as views until you call materialize()

That is the minimum respectable shape if the goal is to beat a Vec<Vec<String>> model and stay competitive with pandas on memory.

Example: Read, Filter, Sort, Write

use rgwml::{
    read_csv, write_csv, ColumnSelector, CompareOp, CsvReadOptions, CsvWriteOptions, Literal,
    NullOrder, Predicate, SortKey, SortOrder,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let table = read_csv("sales.csv", &CsvReadOptions::default())?;

    let filtered = table.filter(&Predicate::Comparison {
        column: ColumnSelector::from("revenue"),
        op: CompareOp::Gt,
        value: Some(Literal::F64(100.0)),
    })?;

    let sorted = filtered
        .sort_by(&[SortKey {
            column: ColumnSelector::from("segment"),
            order: SortOrder::Ascending,
            nulls: NullOrder::Last,
        }])?
        .materialize()?;

    write_csv(&sorted, "sales.filtered.csv", &CsvWriteOptions::default())?;
    Ok(())
}

Example: Group And Aggregate

use std::sync::Arc;

use rgwml::{
    read_csv, AggregateExpr, AggregateOp, ColumnSelector, CsvReadOptions, GroupKey,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let table = read_csv("sales.csv", &CsvReadOptions::default())?;

    let grouped = table.group_by(
        &[GroupKey {
            column: ColumnSelector::from("segment"),
        }],
        &[
            AggregateExpr {
                input: None,
                op: AggregateOp::CountRows,
                alias: Arc::from("rows"),
            },
            AggregateExpr {
                input: Some(ColumnSelector::from("revenue")),
                op: AggregateOp::Sum,
                alias: Arc::from("revenue_sum"),
            },
            AggregateExpr {
                input: Some(ColumnSelector::from("revenue")),
                op: AggregateOp::Mean,
                alias: Arc::from("revenue_mean"),
            },
        ],
    )?;

    println!("groups: {}", grouped.nrows());
    Ok(())
}

Memory Benchmarks

There is a dedicated benchmark binary in this repo:

cargo run --bin bench_memory -- mixed 20000
cargo run --bin bench_memory -- low_card_strings 50000

Current local runs:

Case	`v2` peak delta	Row strings peak delta	Pandas peak delta
`mixed 20000`	`5.5MB`	`11.6MB`	`9.0MB`
`low_card_strings 50000`	`1.5MB`	`15.0MB`	`9.8MB`

The low-cardinality string case is the one I care about most right now. That is where the old String-everywhere model was especially dumb, and where dictionary encoding should earn its keep.

There is also an ops benchmark for the actual table kernels:

cargo run --bin bench_ops -- 200000

That benchmark times:

filter plus materialize
group-by with count, sum, and mean
inner join on typed key columns

If pandas is installed, bench_ops will compare against a small pandas script too.

Build And Verify

cargo test
cargo clippy --all-targets -- -D warnings
cargo run --bin bench_memory -- mixed 20000
cargo run --bin bench_ops -- 200000

Repo Status

This repo is the 2.0.0 break from the old over-scoped crate.

So the honest way to read this project is:

1.x history: over-scoped legacy rgwml
2.0.0: typed local-first rebuild with columnar storage, real joins, and no Python side effects
current work: tighten the kernels and keep the benchmark story honest

If you were using the old Python-heavy “do everything” version, this README is no longer describing that crate. That is deliberate.

rgwml 2.0.0