dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.
Highlights
- Rust core -- fast columnar and streaming engines
- ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
- Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
- True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
- Three interfaces -- CLI binary, Rust library, Python package
- Async-ready --
async/awaitAPI for embedding in web services and stream pipelines
Quick Start
CLI
Rust
use Profiler;
let report = new.analyze_file?;
println!;
println!;
for col in &report.column_profiles
Python
=
Installation
CLI binary
Rust library
[]
= "0.6" # core library (no CLI deps)
= { = "0.6", = ["async-streaming"] }
Python package
# or
Feature Flags
| Feature | Description |
|---|---|
cli (default) |
CLI binary with clap, colored output, progress bars |
minimal |
CSV-only, no CLI -- fastest compile |
async-streaming |
Async profiling engine with tokio |
parquet-async |
Profile Parquet files over HTTP |
database |
Database profiling (connection handling, retry, SSL) |
postgres |
PostgreSQL connector (includes database) |
mysql |
MySQL/MariaDB connector (includes database) |
sqlite |
SQLite connector (includes database) |
all-db |
All three database connectors |
datafusion |
DataFusion SQL engine integration |
python |
Python bindings via PyO3 |
python-async |
Async Python API (includes python + async-streaming) |
full-cli |
CLI + Parquet + all databases |
production |
PostgreSQL + MySQL (common deployment) |
Supported Formats
| Format | Engine | Notes |
|---|---|---|
| CSV | Incremental, Columnar | Auto-detects , ; | \t delimiters |
| JSON | Incremental | Array-of-objects |
| JSONL / NDJSON | Incremental | One object per line |
| Parquet | Columnar | Reads metadata for schema/count without scanning rows |
| Database query | Async | PostgreSQL, MySQL, SQLite via connection string |
| pandas / polars DataFrame | Columnar | Python API only |
| Arrow RecordBatch | Columnar | Via PyCapsule (zero-copy) or Rust API |
| Async byte stream | Incremental | Any AsyncRead source (HTTP, WebSocket, etc.) |
Quality Metrics
dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:
| Dimension | What it measures |
|---|---|
| Completeness | Missing values ratio, complete records ratio, fully-null columns |
| Consistency | Data type consistency, format violations, encoding issues |
| Uniqueness | Duplicate rows, key uniqueness, high-cardinality warnings |
| Accuracy | Outlier ratio, range violations, negative values in positive-only columns |
| Timeliness | Future dates, stale data ratio, temporal ordering violations |
An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.
Documentation
- CLI Usage Guide -- every subcommand and flag
- Python API Guide --
profile(), report types, async, databases - Getting Started -- tutorial from zero to profiling
- Examples Cookbook -- copy-pasteable recipes (CLI, Python, Rust)
- Database Connectors -- PostgreSQL, MySQL, SQLite setup
- Contributing
- Changelog
License
Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.