dataprof 0.6.1

High-performance data profiler with ISO 8000/25012 quality metrics for CSV, JSON/JSONL, and Parquet files
Documentation

Crates.io docs.rs PyPI License: MIT OR Apache-2.0


dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.

Highlights

  • Rust core -- fast columnar and streaming engines
  • ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
  • Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
  • True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
  • Three interfaces -- CLI binary, Rust library, Python package
  • Async-ready -- async/await API for embedding in web services and stream pipelines

Quick Start

CLI

cargo install dataprof

dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquet

Rust

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));

for col in &report.column_profiles {
    println!("  {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}

Python

import dataprof

report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")

for col in report.column_profiles:
    print(f"  {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")

Installation

CLI binary

cargo install dataprof                        # default (CLI only)
cargo install dataprof --features full-cli    # CLI + all formats + databases

Rust library

[dependencies]
dataprof = "0.6"                  # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }

Python package

uv pip install dataprof
# or
pip install dataprof

Feature Flags

Feature Description
cli (default) CLI binary with clap, colored output, progress bars
minimal CSV-only, no CLI -- fastest compile
async-streaming Async profiling engine with tokio
parquet-async Profile Parquet files over HTTP
database Database profiling (connection handling, retry, SSL)
postgres PostgreSQL connector (includes database)
mysql MySQL/MariaDB connector (includes database)
sqlite SQLite connector (includes database)
all-db All three database connectors
datafusion DataFusion SQL engine integration
python Python bindings via PyO3
python-async Async Python API (includes python + async-streaming)
full-cli CLI + Parquet + all databases
production PostgreSQL + MySQL (common deployment)

Supported Formats

Format Engine Notes
CSV Incremental, Columnar Auto-detects , ; | \t delimiters
JSON Incremental Array-of-objects
JSONL / NDJSON Incremental One object per line
Parquet Columnar Reads metadata for schema/count without scanning rows
Database query Async PostgreSQL, MySQL, SQLite via connection string
pandas / polars DataFrame Columnar Python API only
Arrow RecordBatch Columnar Via PyCapsule (zero-copy) or Rust API
Async byte stream Incremental Any AsyncRead source (HTTP, WebSocket, etc.)

Quality Metrics

dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:

Dimension What it measures
Completeness Missing values ratio, complete records ratio, fully-null columns
Consistency Data type consistency, format violations, encoding issues
Uniqueness Duplicate rows, key uniqueness, high-cardinality warnings
Accuracy Outlier ratio, range violations, negative values in positive-only columns
Timeliness Future dates, stale data ratio, temporal ordering violations

An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.

Documentation

License

Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.