rust-data-processing 0.1.1

Schema-first ingestion (CSV, JSON, Parquet, Excel) into an in-memory DataSet, plus Polars-backed pipelines, SQL, profiling, validation, and map/reduce-style processing.
docs.rs failed to build rust-data-processing-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

rust-data-processing

Rust crate and Python package for ingesting common file formats (CSV / JSON / Parquet / Excel (feature-gated)) into an in-memory DataSet, with schema validation + inference helpers, in-memory + parallel processing primitives, and a Polars-backed DataFrame-centric pipeline API. Phase 1 also targets profiling, validation, and outlier-detection APIs and report formats, while keeping a small engine-agnostic public surface (SQL support is Polars-backed). Use whichever binding fits your stack—the capabilities are the same under the hood.

Documentation (read the APIs online)

Link
Combined Rust + Python (main branch, HTML) GitHub Pages — rust-data-processingenable Pages → GitHub Actions in repo Settings if the site is not live yet; see Planning/DOCUMENTATION.md.
Rust crate on crates.io docs.rs — rust-data-processing (populates after the first successful publish)
Markdown API guides API.md (Rust); Python: python-wrapper/API.md
Rust examples (this repo) docs/rust/README.mdCargo.toml, ingestion, DataFrame/SQL, cookbook, execution, benchmarks
Python examples (this repo) docs/python/README.md — same topics via rust_data_processing

Quick start (Python)

import rust_data_processing as rdp

schema = [
    {"name": "id", "data_type": "int64"},
    {"name": "name", "data_type": "utf8"},
]
ds = rdp.ingest_from_path("tests/fixtures/people.csv", schema, {"format": "csv"})
print("rows", ds.row_count())

report = rdp.profile_dataset(ds, {"head_rows": 50, "quantiles": [0.5]})
print("profile rows sampled", report["row_count"])

validation = rdp.validate_dataset(
    ds,
    {"checks": [{"kind": "not_null", "column": "id", "severity": "error"}]},
)
print("checks", validation["summary"]["total_checks"])

From a checkout (Rust + uv required for the editable build):

cd python-wrapper
uv sync --group dev
uv run maturin develop --release

From PyPI (after you publish a release — see Planning/RELEASE_CHECKLIST.md):

pip install rust-data-processing

Use the same import rust_data_processing as rdp pattern; point ingest_from_path at your own CSV, JSON, or Parquet files and schema.

Rust: docs/rust/README.md has copy-paste examples for Cargo.toml, ingestion, Polars-backed pipelines, SQL, transforms, profiling, validation, execution, and benchmarks. Python (expanded): docs/python/README.md. The conceptual Rust API overview is in API.md.

Generate the same HTML as CI locally: ./scripts/build_docs.ps1 (Rust only) or ./scripts/build_docs.ps1 -All (Rust + Python → _site/python/). Maintainer notes: Planning/DOCUMENTATION.md.

Reporting bugs

  • Open a GitHub Issue and use Bug Report or Feature Request so we get version, OS, and repro steps.

  • Security: do not file publicly — read SECURITY.md.

  • How we triage and prioritize: Planning/ISSUE_TRIAGE.md.

  • Status: library APIs are in src/lib.rs; the binary (src/main.rs) is currently just a placeholder.

  • Developer guide: see README_DEV.md (module map, workflows, conventions)

  • Benchmark snapshot (pipeline bench): on this repo (Windows, Criterion), filter → map → reduce(sum):

    • In-memory processing:
      • 100k rows: ~10.42 ms median ((\approx 9.6) million rows/sec)
      • 1M rows: ~113.5 ms median ((\approx 8.8) million rows/sec)
    • Polars-backed pipeline::DataFrame (lazy):
      • 100k rows: ~7.80 ms median ((\approx 12.8) million rows/sec)
      • 1M rows: ~74.10 ms median ((\approx 13.5) million rows/sec)
    • Reproduce: cargo bench --bench pipelines -- --warm-up-time 2 --measurement-time 6 --sample-size 50

Phase 1 scope (roadmap)

Canonical checklist lives in Planning/PHASE1_PLAN.md; this section is the README-friendly summary.

  • Polars-first delegation for ingestion + DataFrame-centric pipelines
  • Polars-backed SQL support (default-on)
  • Engine-agnostic configuration (IngestionOptionsBuilder) + unified error model (IngestionError)
  • Benchmarks + parity checks (ingestion + pipelines + end-to-end)
  • Cookbook examples (Polars-first docs + SQL examples)
  • “Pit of success” defaults (sane knobs; avoid promising engine-specific tuning we can’t support)
  • Feature flags + dependency surface minimization
  • Transformation wrappers + end-user transformation schema/spec (“to/from”) on top of Polars + existing in-memory layers
  • Feature-gated direct DB ingestion via ConnectorX (DB → Arrow → DataSet) + compatibility research notes
  • CDC feasibility spike + interface boundary (Phase 2 candidate)
  • Profiling APIs: metrics set + sampling/large-data modes
  • Validation APIs: DSL + built-in checks + severity handling + reporting
  • Outlier detection: primitives + explainable outputs

Platform support

  • Supported OSes: Windows, Linux, and macOS.

  • Works out of the box: the library is written in portable Rust (no OS-specific runtime assumptions).

  • Build prerequisites:

    • macOS: install Xcode Command Line Tools (xcode-select --install) for the system linker/C toolchain.
    • Linux: install a basic build toolchain (e.g. GCC/Clang via your distro’s build-essential equivalent).
    • Windows: see Development on Windows (toolchain + linker).

    Parquet support pulls in native compression dependencies (e.g. zstd-sys); Cargo will build them automatically once a C toolchain is available.

  • Benchmarks:

    • cargo bench --bench pipelines is cross-platform.
    • benchmarks.ps1 is a Windows/PowerShell convenience wrapper; on Linux/macOS you can run it via pwsh or just run cargo bench directly.
    • scripts/run_benchmarks.ps1 runs all Criterion benchmarks (pipelines + ingestion + map/reduce + profiling + validation + outliers).

Python bindings

Bindings live under python-wrapper/ (PyO3 + maturin + uv). User-facing docs: python-wrapper/README.md, python-wrapper/API.md, python-wrapper/README_DEV.md. The native module calls this crate; Polars stays on the Rust side.

Rust examples (ingestion, DataFrame/SQL, transforms, profiling, execution, benchmarks): docs/rust/README.md.

What data can be consumed? (Epic 1 / Stories 1.1–1.2)

File formats (auto-detected by extension)

  • CSV: .csv (must include headers)
  • JSON: .json (array-of-objects) and .ndjson (newline-delimited objects)
    • Nested fields are supported via dot paths in schema field names (e.g. user.name)
  • Parquet: .parquet, .pq
  • Excel/workbooks: .xlsx, .xls, .xlsm, .xlsb, .ods (requires feature excel)

Supported value types

You define a Schema using these logical types:

  • DataType::Int64
  • DataType::Float64
  • DataType::Bool
  • DataType::Utf8

Ingestion yields a DataSet whose cells are Value::{Int64, Float64, Bool, Utf8, Null}.

  • Null handling:
    • CSV: empty/whitespace-only cells become Value::Null
    • JSON: explicit null becomes Value::Null
    • Excel: empty cells become Value::Null
    • Parquet: nulls become Value::Null

Processing pipelines (Epic 1 / Story 1.2)

Once you have a DataSet (typically from ingestion::ingest_from_path), you can apply in-memory transformations using rust_data_processing::processing:

  • filter(&DataSet, predicate) -> DataSet
  • map(&DataSet, mapper) -> DataSet
  • reduce(&DataSet, column, ReduceOp) -> Option<Value> — includes mean, variance, std dev (VarianceKind::{Population, Sample}), sum of squares, L2 norm, count distinct (non-null), plus count / sum / min / max
  • feature_wise_mean_std(&DataSet, &[&str], VarianceKind) — one pass over rows for mean + std on several numeric columns (FeatureMeanStd)
  • arg_max_row / arg_min_row — first row index where a column is max/min (ties: smallest index)
  • top_k_by_frequency — top‑(k) (value, count) pairs for label-style columns

Polars-backed equivalents for whole-frame scalars: pipeline::DataFrame::reduce, feature_wise_mean_std. Semantics: Planning/REDUCE_AGG_SEMANTICS.md.

Full Rust examples (filter/map/reduce, aggregates, parallel ExecutionEngine, Criterion benchmarks, ingestion observers): docs/rust/README.md § Processing pipelines.

Supported formats

  • CSV: headers required; schema fields must exist; columns may be reordered.
  • JSON: supports JSON array of objects or NDJSON; nested fields via dot paths (e.g. user.name).
  • Parquet: validates required columns; uses the Parquet record API for reading.
  • Excel: behind the Cargo feature excel.

Features

  • excel: enable Excel ingestion (adds calamine)
  • excel_test_writer: enables Excel integration tests that generate an .xlsx at runtime
  • sql: enable Polars-backed SQL support (adds polars-sql). Enabled by default.
  • db_connectorx: enable direct DB ingestion via ConnectorX (DB → Arrow → DataSet) including Postgres/MySQL/MS SQL/Oracle sources
  • arrow: enable Arrow interop helpers (adds arrow)
  • serde_arrow: enable Serde-based Arrow interop helpers (adds serde + serde_arrow)
  • Note: ConnectorX’s Postgres support uses OpenSSL; on Windows you may need additional build prerequisites (e.g. Perl for vendored OpenSSL or a system OpenSSL install).

Disable default features (including SQL) if you want a smaller dependency surface:

[dependencies]
rust-data-processing = { path = ".", default-features = false }

“Pit of success” defaults (Phase 1)

  • Ingestion defaults: format is auto-detected by extension; observers are off by default; failures are surfaced as IngestionError (no automatic retries).
  • Execution defaults: ExecutionOptions::default() uses available parallelism, a moderate chunk_size, and throttles in-flight chunks.
  • Polars pipelines + SQL: we intentionally avoid exposing engine-specific tuning knobs in the public API. If you need low-level tuning, use Polars’ own configuration (e.g. environment variables) and treat behavior as Polars-owned.

Run tests

./scripts/run_unit_tests.ps1

Generate API docs (Rustdoc + optional Python pdoc)

Rust uses Rustdoc; Python bindings use pdoc (same as CI). See Documentation for published links.

./scripts/build_docs.ps1              # Rust only → target/doc/
./scripts/build_docs.ps1 -All         # Rust + Python → target/doc/ and _site/python/

Deep tests (large/realistic fixtures)

Deep tests are not run as part of ./scripts/run_unit_tests.ps1. They are feature-gated behind deep_tests.

./scripts/run_deep_tests.ps1

Development on Windows (toolchain + linker)

Rust installs its tools into:

  • %USERPROFILE%\.cargo\bin (example: C:\Users\Vihan\.cargo\bin)

That directory must be on your PATH so rustc, cargo, and rustup can be found.

If you see error: linker 'link.exe' not found, install Build Tools for Visual Studio 2026 and select:

  • Desktop development with C++
  • MSVC v144 - VS 2026 C++ x64/x86 build tools
  • Windows 10/11 SDK

Then open the project from Developer PowerShell for VS 2026 (or restart your terminal) and rerun:

cargo test

Verify toolchain

where.exe rustc
rustc --version
cargo --version
rustup --version

Fix PATH for the current PowerShell session (no restart)

$env:Path = [Environment]::GetEnvironmentVariable('Path','Machine') + ';' + `
            [Environment]::GetEnvironmentVariable('Path','User')

Ensure %USERPROFILE%\.cargo\bin is on your User PATH (permanent)

$cargoBin = Join-Path $env:USERPROFILE '.cargo\bin'
$userPath = [Environment]::GetEnvironmentVariable('Path','User')
if ([string]::IsNullOrWhiteSpace($userPath)) { $userPath = '' }

$parts = $userPath -split ';' | Where-Object { $_ -and $_.Trim() -ne '' }
if ($parts -notcontains $cargoBin) {
  [Environment]::SetEnvironmentVariable('Path', (@($parts + $cargoBin) -join ';'), 'User')
}

After changing the User PATH, restart your terminal (or log out/in) so new shells inherit it.

License

Dual-licensed under your choice of Apache License 2.0 or MIT. See LICENSE-APACHE and LICENSE-MIT.

SPDX-License-Identifier: MIT OR Apache-2.0

Publishing to crates.io

Maintainers: see Planning/RELEASE_CHECKLIST.md and How_TO_deploy.md. After the first successful cargo publish, API docs appear on docs.rs for the published version.