dataprof
A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.
Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.
Privacy & Transparency
DataProf processes all data locally on your machine. Zero telemetry, zero external data transmission.
Read exactly what DataProf analyzes →
- 100% local processing - your data never leaves your machine
- No telemetry or tracking
- Open source & fully auditable
- Read-only database access (when using DB features)
Complete transparency: Every metric, calculation, and data point is documented with source code references for independent verification.
CI/CD Integration
Automate data quality checks in your workflows with our GitHub Action:
- name: DataProf Quality Check
uses: AndreaBozzo/dataprof-actions@v1
with:
file: 'data/dataset.csv'
quality-threshold: 80
fail-on-issues: true
# Batch mode (NEW)
recursive: true
output-html: 'quality-report.html'
- Zero setup - works out of the box
- ISO 8000/25012 compliant - industry-standard quality metrics
- Batch processing - analyze entire directories recursively
- Flexible - customizable thresholds and output formats
- Fast - typically completes in under 2 minutes
Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.
Quick Start
CLI (Recommended - Full Features)
Installation: Download pre-built binaries from Releases or build from source with
cargo install dataprof.
Note: After building with
cargo build --release, the binary is located attarget/release/dataprof-cli.exe(Windows) ortarget/release/dataprof(Linux/Mac). Run it from the project root astarget/release/dataprof-cli.exe <command>or add it to your PATH.
Basic Analysis
# Comprehensive quality analysis
# Analyze Parquet files (requires --features parquet)
# Windows example (from project root after cargo build --release)
HTML Reports
# Generate HTML report with visualizations
# Custom template
Batch Processing
# Process entire directory with parallel execution
# Generate HTML batch dashboard
# JSON export for CI/CD automation
# JSON output to stdout
# With custom filter and progress

Database Analysis
# PostgreSQL table profiling
# Custom SQL query
Benchmarking
# Benchmark different engines on your data
# Show engine information
Advanced Options
# Streaming for large files
# JSON output for programmatic use
# Custom ISO threshold profile
Quick Reference: All commands follow the pattern dataprof <command> [args]. Use dataprof help or dataprof <command> --help for detailed options.
Python Bindings
# Comprehensive quality analysis (ISO 8000/25012 compliant)
=
# Access individual quality dimensions
=
# Batch processing
=
Note: Database profiling is available via CLI only. Python users can export SQL results to CSV and use
analyze_csv_with_quality().
Full Python API Documentation →
Rust Library
use *;
// High-performance Arrow processing for large files (>100MB)
// Requires compilation with: cargo build --features arrow
let profiler = columnar;
let report = profiler.analyze_csv_file?;
// Standard adaptive profiling (recommended for most use cases)
let profiler = auto;
let report = profiler.analyze_file?;
Development
Want to contribute or build from source? Here's what you need:
Prerequisites
- Rust (latest stable via rustup)
- Docker (for database testing)
Quick Setup
Feature Flags
dataprof uses optional features to keep compile times fast and binaries lean:
# Minimal build (CSV/JSON only, ~60s compile)
# With Apache Arrow (columnar processing, ~90s compile)
# With Parquet support (requires arrow, ~95s compile)
# With database connectors
# All features (full functionality, ~130s compile)
When to use Arrow?
- ✅ Files > 100MB with many columns (>20)
- ✅ Columnar data with uniform types
- ✅ Need maximum throughput (up to 13x faster)
- ❌ Small files (<10MB) - standard engine is faster
- ❌ Mixed/messy data - streaming engine handles better
When to use Parquet?
- ✅ Analytics workloads with columnar data
- ✅ Data lake architectures
- ✅ Integration with Spark, Pandas, PyArrow
- ✅ Efficient storage and compression
- ✅ Type-safe schema preservation
Common Development Tasks
Documentation
Privacy & Transparency
- What DataProf Does - Complete transparency guide with source code verification
User Guides
- Python API Reference - Full Python API documentation
- Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
- Database Connectors - Production database connectivity
- Apache Arrow Integration - Columnar processing guide
- CLI Usage Guide - Complete CLI reference
Developer Guides
- Development Guide - Complete setup and contribution guide
- Performance Guide - Optimization and benchmarking
- Performance Benchmarks - Benchmark results and methodology
License
Licensed under the MIT License. See LICENSE for details.