dataprof
A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.
Automatic Pattern Detection - Identifies 16+ common data patterns including emails, phone numbers, IP addresses, coordinates, IBAN, file paths, and more.
Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.
Privacy & Transparency
DataProf processes all data locally on your machine. Zero telemetry, zero external data transmission.
Read exactly what DataProf analyzes →
- 100% local processing - your data never leaves your machine
- No telemetry or tracking
- Open source & fully auditable
- Read-only database access (when using DB features)
Complete transparency: Every metric, calculation, and data point is documented with source code references for independent verification.
CI/CD Integration
Automate data quality checks in your workflows with our GitHub Action:
- name: DataProf Quality Check
uses: AndreaBozzo/dataprof-actions@v1
with:
file: 'data/dataset.csv'
quality-threshold: 80
fail-on-issues: true
# Batch mode (NEW)
recursive: true
output-html: 'quality-report.html'
- Zero setup - works out of the box
- ISO 8000/25012 compliant - industry-standard quality metrics
- Batch processing - analyze entire directories recursively
- Flexible - customizable thresholds and output formats
- Fast - typically completes in under 2 minutes
Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.
Quick Start
Installation
# Install from crates.io (recommended)
# Or build from source
That's it! Now you can use dataprof-cli from anywhere.
Basic Usage
# Analyze a CSV file
# Get detailed analysis
# Generate HTML report
# Analyze Parquet files (requires --features parquet)
More Features
# Batch process entire directory
# Database profiling
# Benchmark engines
# Streaming mode for large files
# JSON output for automation

Need help? Run dataprof-cli --help or dataprof-cli <command> --help for detailed options.
Python Bindings
# Comprehensive quality analysis (ISO 8000/25012 compliant)
=
# Access individual quality dimensions
=
# Batch processing
=
# Async database profiling (requires python-async feature)
= await
Note: Async database profiling requires building with
--features python-async,database,postgres(or mysql/sqlite). See Async Support below.
Full Python API Documentation →
Rust Library
use *;
// High-performance Arrow processing for large files (>100MB)
// Requires compilation with: cargo build --features arrow
let profiler = columnar;
let report = profiler.analyze_csv_file?;
// Standard adaptive profiling (recommended for most use cases)
let profiler = auto;
let report = profiler.analyze_file?;
Development
Want to contribute or build from source? Here's what you need:
Prerequisites
- Rust (latest stable via rustup)
- Docker (for database testing)
Quick Setup
Feature Flags
dataprof uses optional features to keep compile times fast and binaries lean:
# Minimal build (CSV/JSON only, ~60s compile)
# With Apache Arrow (columnar processing, ~90s compile)
# With Parquet support (requires arrow, ~95s compile)
# With database connectors
# With Python async support (for async database profiling)
# All features (full functionality, ~130s compile)
When to use Arrow?
- ✅ Files > 100MB with many columns (>20)
- ✅ Columnar data with uniform types
- ✅ Need maximum throughput (up to 13x faster)
- ❌ Small files (<10MB) - standard engine is faster
- ❌ Mixed/messy data - streaming engine handles better
When to use Parquet?
- ✅ Analytics workloads with columnar data
- ✅ Data lake architectures
- ✅ Integration with Spark, Pandas, PyArrow
- ✅ Efficient storage and compression
- ✅ Type-safe schema preservation
Async Support
DataProf supports asynchronous operations for non-blocking database profiling, both in Rust and Python.
Rust Async (Database Features)
Database connectors are fully async and use tokio runtime:
use ;
async
Available async features:
- ✅ Non-blocking database queries
- ✅ Concurrent query execution
- ✅ Streaming for large result sets
- ✅ Connection pooling with SQLx
- ✅ Retry logic with exponential backoff
Python Async (python-async Feature)
Enable async Python bindings for database profiling:
# Build with async support
# Test connection
= await
# Get table schema
= await
# Count rows
= await
# Profile database query
= await
Benefits:
- ✅ Non-blocking I/O for better performance
- ✅ Concurrent database profiling
- ✅ Integration with async Python frameworks (FastAPI, aiohttp, etc.)
- ✅ Efficient resource usage
See also: examples/async_database_example.py for complete examples.
Common Development Tasks
Documentation
Privacy & Transparency
- What DataProf Does - Complete transparency guide with source code verification
User Guides
- Python API Reference - Full Python API documentation
- Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
- Database Connectors - Production database connectivity
- Apache Arrow Integration - Columnar processing guide
- CLI Usage Guide - Complete CLI reference
Developer Guides
- Development Guide - Complete setup and contribution guide
- Performance Guide - Optimization and benchmarking
- Performance Benchmarks - Benchmark results and methodology
License
Licensed under the MIT License. See LICENSE for details.