data_table 0.22.0

A lightweight Rust data table structure supporting numeric columns and categorical factors.
Documentation

DataTable Rust Library

Overview

The DataTable crate provides a flexible framework for reading, processing, and managing survival or biomedical data tables that contain both numeric and categorical (factor) variables.
It is particularly suited for bioinformatics, epidemiology, and statistical modeling pipelines where tabular datasets must be cleaned, imputed, split, and exported with precise factor-level tracking.

The library handles automatic detection and management of categorical variables through factor definitions, supports one-hot encoding, and includes utilities for data filtering, imputation, and persistence to CSV and JSON formats.


Features

  • CSV Input Parsing

    • Reads numeric and categorical columns from delimited files.
    • Automatically infers factor definitions from provided JSON or generates them if missing.
    • One-hot expansion for categorical variables with flexible level mapping.
  • Factor Management

    • Factor objects track categorical levels, numeric encodings, and one-hot states.
    • Load and save factors to JSON for reproducibility.
  • Data Cleaning

    • Remove rows or columns with missing values (NaN).
    • Impute missing values via K-Nearest Neighbours (impute_knn).
    • Filter low-variance columns or features with excessive missingness.
  • Data Export

    • Write numeric and factor-expanded data to CSV.
    • Save and reload factor definitions to/from JSON.
  • Utilities

    • Split datasets into training and test sets with random shuffling.
    • Access individual columns as Vec<f64>, Vec<u8>, or categorical strings.
    • Print concise dataset summaries for debugging and inspection.

Example Usage

use std::collections::HashSet;
use survival_data::DataTable;

fn main() -> anyhow::Result<()> {
    let file_path = "data/example.csv";
    let factors_path = "data/factors.json";
    let categorical: HashSet<String> = ["status".into(), "sex".into()].into();

    // Load data, automatically building factors if needed
    let data = DataTable::from_file(file_path, b',', categorical, factors_path)?;

    // Clean and impute data
    let usable_features = data.filter_features_by_na(0.1);
    let mut filtered_data = data.clone();
    filtered_data.filter_all_na_rows(&usable_features);
    filtered_data.impute_knn(3, 5, true);

    // Split for training/testing
    let (train, test) = filtered_data.train_test_split(0.7);

    // Save cleaned data
    train.to_file("data/train.csv", b',')?;
    test.to_file("data/test.csv", b',');
    Ok(())
}

Example Factor Definition (JSON)

[
  {
    "column": "status",
    "levels": ["alive", "dead"],
    "numeric": [0.0, 1.0],
    "matching": null,
    "one_hot": false
  },
  {
    "column": "treatment",
    "levels": ["control", "drugA", "drugB"],
    "numeric": [0.0, 1.0, 2.0],
    "matching": null,
    "one_hot": true
  }
]

Create Win and Linux binaries on Linux:

On Ubuntu you can simply:

rustup target add x86_64-unknown-linux-musl
rustup target add x86_64-pc-windows-gnu

And then install the missing linkers with

sudo apt update
sudo apt install musl-tools mingw-w64

Later on you can compile for both systems using

cargo build --release --target x86_64-unknown-linux-musl
cargo build --release --target x86_64-pc-windows-gnu

If you need to explain this crate quickly in a chat session, say:

“This Rust library (DataTable) helps parse and preprocess survival or biomedical datasets. It automatically handles categorical factors (with optional one-hot encoding), performs data cleaning and imputation, and supports consistent factor definitions via JSON. It’s ideal for preparing structured data for downstream modeling or statistical analysis.”


License

MIT License © 2025