dataprof

A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.

Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.

Privacy & Transparency

DataProf processes all data locally on your machine. Zero telemetry, zero external data transmission.

Read exactly what DataProf analyzes →

100% local processing - your data never leaves your machine
No telemetry or tracking
Open source & fully auditable
Read-only database access (when using DB features)

Complete transparency: Every metric, calculation, and data point is documented with source code references for independent verification.

CI/CD Integration

Automate data quality checks in your workflows with our GitHub Action:

- name: DataProf Quality Check
  uses: AndreaBozzo/dataprof-actions@v1
  with:
    file: 'data/dataset.csv'
    quality-threshold: 80
    fail-on-issues: true
    # Batch mode (NEW)
    recursive: true
    output-html: 'quality-report.html'

Get the Action →

Zero setup - works out of the box
ISO 8000/25012 compliant - industry-standard quality metrics
Batch processing - analyze entire directories recursively
Flexible - customizable thresholds and output formats
Fast - typically completes in under 2 minutes

Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.

Quick Start

CLI (Recommended - Full Features)

Installation: Download pre-built binaries from Releases or build from source with cargo install dataprof.

Note: After building with cargo build --release, the binary is located at target/release/dataprof-cli.exe (Windows) or target/release/dataprof (Linux/Mac). Run it from the project root as target/release/dataprof-cli.exe <command> or add it to your PATH.

Basic Analysis

# Comprehensive quality analysis
dataprof analyze data.csv --detailed

# Analyze Parquet files (requires --features parquet)
dataprof analyze data.parquet --detailed

# Windows example (from project root after cargo build --release)
target\release\dataprof-cli.exe analyze data.csv --detailed

HTML Reports

# Generate HTML report with visualizations
dataprof report data.csv -o quality_report.html

# Custom template
dataprof report data.csv --template custom.hbs --detailed

Batch Processing

# Process entire directory with parallel execution
dataprof batch /data/folder --recursive --parallel

# Generate HTML batch dashboard
dataprof batch /data/folder --recursive --html batch_report.html

# JSON export for CI/CD automation
dataprof batch /data/folder --json batch_results.json --recursive

# JSON output to stdout
dataprof batch /data/folder --format json --recursive

# With custom filter and progress
dataprof batch /data/folder --filter "*.csv" --parallel --progress

DataProf Batch Report

Database Analysis

# PostgreSQL table profiling
dataprof database postgres://user:pass@host/db --table users

# Custom SQL query
dataprof database sqlite://data.db --query "SELECT * FROM users WHERE active=1"

Benchmarking

# Benchmark different engines on your data
dataprof benchmark data.csv

# Show engine information
dataprof benchmark --info

Advanced Options

# Streaming for large files
dataprof analyze large_dataset.csv --streaming --sample 10000

# JSON output for programmatic use
dataprof analyze data.csv --format json --output results.json

# Custom ISO threshold profile
dataprof analyze data.csv --threshold-profile strict

Quick Reference: All commands follow the pattern dataprof <command> [args]. Use dataprof help or dataprof <command> --help for detailed options.

Python Bindings

pip install dataprof

import dataprof

# Comprehensive quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Access individual quality dimensions
metrics = report.data_quality_metrics
print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
print(f"Consistency: {metrics.data_type_consistency:.1f}%")
print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")

Note: Database profiling is available via CLI only. Python users can export SQL results to CSV and use analyze_csv_with_quality().

Full Python API Documentation →

Rust Library

cargo add dataprof

use dataprof::*;

// High-performance Arrow processing for large files (>100MB)
// Requires compilation with: cargo build --features arrow
#[cfg(feature = "arrow")]
let profiler = DataProfiler::columnar();
#[cfg(feature = "arrow")]
let report = profiler.analyze_csv_file("large_dataset.csv")?;

// Standard adaptive profiling (recommended for most use cases)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

Development

Want to contribute or build from source? Here's what you need:

Prerequisites

Rust (latest stable via rustup)
Docker (for database testing)

Quick Setup

git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release  # Build the project
docker-compose -f .devcontainer/docker-compose.yml up -d  # Start test databases

Feature Flags

dataprof uses optional features to keep compile times fast and binaries lean:

# Minimal build (CSV/JSON only, ~60s compile)
cargo build --release

# With Apache Arrow (columnar processing, ~90s compile)
cargo build --release --features arrow

# With Parquet support (requires arrow, ~95s compile)
cargo build --release --features parquet

# With database connectors
cargo build --release --features postgres,mysql,sqlite

# All features (full functionality, ~130s compile)
cargo build --release --all-features

When to use Arrow?

✅ Files > 100MB with many columns (>20)
✅ Columnar data with uniform types
✅ Need maximum throughput (up to 13x faster)
❌ Small files (<10MB) - standard engine is faster
❌ Mixed/messy data - streaming engine handles better

When to use Parquet?

✅ Analytics workloads with columnar data
✅ Data lake architectures
✅ Integration with Spark, Pandas, PyArrow
✅ Efficient storage and compression
✅ Type-safe schema preservation

Common Development Tasks

cargo test          # Run all tests
cargo bench         # Performance benchmarks
cargo fmt           # Format code
cargo clippy        # Code quality checks

Documentation

Privacy & Transparency

What DataProf Does - Complete transparency guide with source code verification

User Guides

Python API Reference - Full Python API documentation
Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
Database Connectors - Production database connectivity
Apache Arrow Integration - Columnar processing guide
CLI Usage Guide - Complete CLI reference

Developer Guides

Development Guide - Complete setup and contribution guide
Performance Guide - Optimization and benchmarking
Performance Benchmarks - Benchmark results and methodology

License

Licensed under the MIT License. See LICENSE for details.

dataprof 0.4.78

dataprof

Privacy & Transparency

CI/CD Integration

Quick Start

CLI (Recommended - Full Features)

Basic Analysis

HTML Reports

Batch Processing

Database Analysis

Benchmarking

Advanced Options

Python Bindings

Rust Library

Development

Prerequisites

Quick Setup

Feature Flags

Common Development Tasks

Documentation

Privacy & Transparency

User Guides

Developer Guides

License