rgwml
I made the first version of rgwml far too ambitious. It tried to be a Rust-flavoured Pandas clone, a Python bridge layer, an ML wrapper, a database client, and an AI glue crate at the same time. It got downloads, but the shape was wrong.
This repo is the correction.
rgwml is now a local-first tabular data library built around a typed, columnar in-memory model. No hidden Python. No network glue. No home-directory writes. No install-time side effects. Just tables, local files, and a memory model that has a real chance of staying lean.
What rgwml is now
- A typed table core with columnar storage
- A local CSV reader and writer
- A small query surface for
filter,sort_by,join,group_by, andmaterialize - A memory benchmark harness for comparing the new engine against a row-of-strings baseline and pandas
What I removed on purpose
- OpenAI helpers
- XGBoost wrappers
- clustering helpers
- database clients
- Google Sheets and public URL loaders
- shelling out to Python scripts
- hidden filesystem writes during library init
That old scope was bloated and dishonest for a Rust library. The new build is intentionally narrower.
Current state
What works today:
- CSV read and write
- schema inference for
bool,i64,f64, and string columns - opt-in date and timestamp inference for ISO-like CSV values
- dictionary encoding for low-cardinality string columns
- bitmap-backed null tracking
- real pretty rendering for schema and preview rows
- filter, sort, join, group-by, and materialize on the new table core
- benchmark tooling
What is not done yet:
- more local file readers once they are real and worth supporting
Date And Timestamp Policy
If you enable date_inference, rgwml currently recognizes dates in this format:
YYYY-MM-DD
If you enable timestamp_inference, rgwml currently recognizes:
- RFC3339 / ISO 8601 timestamps with a timezone or trailing
Z YYYY-MM-DD HH:MM:SS[.fraction]YYYY-MM-DDTHH:MM:SS[.fraction]YYYY-MM-DD HH:MMYYYY-MM-DDTHH:MM
Policy details:
- date-only values infer
Date32, notTimestampMs - date-only values are accepted for explicitly typed
TimestampMscolumns and normalize to midnight UTC - anything outside those formats stays string data unless you provide an explicit schema
Data model
The old repo forced everything into String. That was one of the worst design decisions in the whole crate.
The new core stores data by column:
- fixed-width types use contiguous
Vec<T> - strings use
offsets + bytes - repeated strings can use dictionary encoding with integer keys
- nulls live in bitmaps instead of
Vec<Option<T>> - row selections can stay as views until you call
materialize()
That is the minimum respectable shape if the goal is to beat a Vec<Vec<String>> model and stay competitive with pandas on memory.
Example: Read, Filter, Sort, Write
use ;
Example: Group And Aggregate
use Arc;
use ;
Memory Benchmarks
There is a dedicated benchmark binary in this repo:
Current local runs:
| Case | v2 peak delta |
Row strings peak delta | Pandas peak delta |
|---|---|---|---|
mixed 20000 |
5.5MB |
11.6MB |
9.0MB |
low_card_strings 50000 |
1.5MB |
15.0MB |
9.8MB |
The low-cardinality string case is the one I care about most right now. That is where the old String-everywhere model was especially dumb, and where dictionary encoding should earn its keep.
There is also an ops benchmark for the actual table kernels:
That benchmark times:
- filter plus materialize
- group-by with count, sum, and mean
- inner join on typed key columns
If pandas is installed, bench_ops will compare against a small pandas script too.
Build And Verify
Repo Status
This repo is the 2.0.0 break from the old over-scoped crate.
So the honest way to read this project is:
1.xhistory: over-scoped legacyrgwml2.0.0: typed local-first rebuild with columnar storage, real joins, and no Python side effects- current work: tighten the kernels and keep the benchmark story honest
If you were using the old Python-heavy “do everything” version, this README is no longer describing that crate. That is deliberate.