dataprof 0.7.1 - Docs.rs

dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard, all with bounded memory usage that lets you profile datasets far larger than your available RAM.

[!NOTE] This is a work in progress, the project is now over 6 months old and the API is not stable and may change in future versions. Please, report any issues or suggestions you may have.

Highlights

Rust core -- fast columnar and streaming engines
ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
Boolean Support -- Native profiling of boolean columns with true/false statistics
True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
Three interfaces -- CLI binary, Rust library, Python package
New Python Ecosystem -- Export to pandas, Polars, Arrow, and JSON with rounding consistency
Async-ready -- async/await API for embedding in web services and stream pipelines

Quick Start

CLI

cargo install dataprof

dataprof analyze data.csv --detailed
dataprof analyze data.csv --metrics schema --metrics statistics  # fast partial analysis
dataprof schema data.csv
dataprof count data.parquet

Rust

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));

for col in &report.column_profiles {
    println!("  {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}

Python

import dataprof as dp

# Profile with selective metrics for speed
report = dp.profile("data.csv", metrics=["schema", "statistics"])
print(f"{report.rows} rows, {report.columns} columns")

# Dict-like access and export to pandas/Polars
age_stats = report["age"]
print(f"Mean age: {age_stats.mean}")

df = report.to_dataframe()
report.save("report.json")

Installation

CLI binary

cargo install dataprof                        # default (CLI only)
cargo install dataprof --features full-cli    # CLI + all formats + databases

Rust library

[dependencies]
dataprof = "0.7"                  # core library (no CLI deps)
dataprof = { version = "0.7", features = ["async-streaming"] }

Python package

uv pip install dataprof
# or
pip install dataprof

Feature Flags

Feature	Description
`cli` (default)	CLI binary with clap, colored output, progress bars
`minimal`	CSV-only, no CLI -- fastest compile
`async-streaming`	Async profiling engine with tokio
`parquet-async`	Profile Parquet files over HTTP
`database`	Database profiling (connection handling, retry, SSL)
`postgres`	PostgreSQL connector (includes `database`)
`mysql`	MySQL/MariaDB connector (includes `database`)
`sqlite`	SQLite connector (includes `database`)
`all-db`	All three database connectors
`datafusion`	DataFusion SQL engine integration
`python`	Python bindings via PyO3
`python-async`	Async Python API (includes `python` + `async-streaming`)
`full-cli`	CLI + Parquet + all databases
`production`	PostgreSQL + MySQL (common deployment)

Supported Formats

Format	Engine	Notes
CSV	Incremental, Columnar	Auto-detects `,` `;` `\|` `\t` delimiters
JSON	Incremental	Array-of-objects
JSONL / NDJSON	Incremental	One object per line
Parquet	Columnar	Reads metadata for schema/count without scanning rows
Database query	Async	PostgreSQL, MySQL, SQLite via connection string
pandas / polars DataFrame	Columnar	Python API only
Arrow RecordBatch	Columnar	Via PyCapsule (zero-copy) or Rust API
Async byte stream	Incremental	Any `AsyncRead` source (HTTP, WebSocket, etc.)

Quality Metrics

dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:

Dimension	What it measures
Completeness	Missing values ratio, complete records ratio, fully-null columns
Consistency	Data type consistency, format violations, encoding issues
Uniqueness	Duplicate rows, key uniqueness, high-cardinality warnings
Accuracy	Outlier ratio, range violations, negative values in positive-only columns
Timeliness	Future dates, stale data ratio, temporal ordering violations

An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.

Documentation

CLI Usage Guide -- every subcommand and flag
Python API Guide -- profile(), report types, async, databases
Getting Started -- tutorial from zero to profiling
Examples Cookbook -- copy-pasteable recipes (CLI, Python, Rust)
Database Connectors -- PostgreSQL, MySQL, SQLite setup
Contributing
Changelog

Academic Work

dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:

A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]

The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.

BibTeX

@inproceedings{bozzo2026compiled,
  author={Bozzo, Andrea},
  title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
  booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
  year={2026},
  note={Under review}
}

License

Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.