DataTable Rust Library
Overview
The DataTable crate provides a flexible framework for reading, processing, and managing survival or biomedical data tables that contain both numeric and categorical (factor) variables.
It is particularly suited for bioinformatics, epidemiology, and statistical modeling pipelines where tabular datasets must be cleaned, imputed, split, and exported with precise factor-level tracking.
The library handles automatic detection and management of categorical variables through factor definitions, supports one-hot encoding, and includes utilities for data filtering, imputation, and persistence to CSV and JSON formats.
Features
-
CSV Input Parsing
- Reads numeric and categorical columns from delimited files.
- Automatically infers factor definitions from provided JSON or generates them if missing.
- One-hot expansion for categorical variables with flexible level mapping.
-
Factor Management
Factorobjects track categorical levels, numeric encodings, and one-hot states.- Load and save factors to JSON for reproducibility.
-
Data Cleaning
- Remove rows or columns with missing values (
NaN). - Impute missing values via K-Nearest Neighbours (
impute_knn). - Filter low-variance columns or features with excessive missingness.
- Remove rows or columns with missing values (
-
Data Export
- Write numeric and factor-expanded data to CSV.
- Save and reload factor definitions to/from JSON.
-
Utilities
- Split datasets into training and test sets with random shuffling.
- Access individual columns as
Vec<f64>,Vec<u8>, or categorical strings. - Print concise dataset summaries for debugging and inspection.
Example Usage
use HashSet;
use DataTable;
Example Factor Definition (JSON)
Create Win and Linux binaries on Linux:
On Ubuntu you can simply:
rustup target add x86_64-unknown-linux-musl
rustup target add x86_64-pc-windows-gnu
And then install the missing linkers with
sudo apt update
sudo apt install musl-tools mingw-w64
Later on you can compile for both systems using
cargo build --release --target x86_64-unknown-linux-musl
cargo build --release --target x86_64-pc-windows-gnu
If you need to explain this crate quickly in a chat session, say:
“This Rust library (
DataTable) helps parse and preprocess survival or biomedical datasets. It automatically handles categorical factors (with optional one-hot encoding), performs data cleaning and imputation, and supports consistent factor definitions via JSON. It’s ideal for preparing structured data for downstream modeling or statistical analysis.”
License
MIT License © 2025