dataprof

A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.

Automatic Pattern Detection - Identifies 16+ common data patterns including emails, phone numbers, IP addresses, coordinates, IBAN, file paths, and more.

Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.

Privacy & Transparency

DataProf processes all data locally on your machine. Zero telemetry, zero external data transmission.

Read exactly what DataProf analyzes →

100% local processing - your data never leaves your machine
No telemetry or tracking
Open source & fully auditable
Read-only database access (when using DB features)

Complete transparency: Every metric, calculation, and data point is documented with source code references for independent verification.

CI/CD Integration

Automate data quality checks in your workflows with our GitHub Action:

- name: DataProf Quality Check
  uses: AndreaBozzo/dataprof-actions@v1
  with:
    file: 'data/dataset.csv'
    quality-threshold: 80
    fail-on-issues: true
    # Batch mode (NEW)
    recursive: true
    output-html: 'quality-report.html'

Get the Action →

Zero setup - works out of the box
ISO 8000/25012 compliant - industry-standard quality metrics
Batch processing - analyze entire directories recursively
Flexible - customizable thresholds and output formats
Fast - typically completes in under 2 minutes

Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.

Quick Start

Installation

# Install from crates.io (recommended)
cargo install dataprof

# Or build from source
git clone https://github.com/AndreaBozzo/dataprof
cd dataprof
cargo install --path .

That's it! Now you can use dataprof-cli from anywhere.

Basic Usage

# Analyze a CSV file
dataprof-cli analyze data.csv

# Get detailed analysis
dataprof-cli analyze data.csv --detailed

# Generate HTML report
dataprof-cli report data.csv -o report.html

# Analyze Parquet files (requires --features parquet)
dataprof-cli analyze data.parquet

More Features

# Batch process entire directory
dataprof-cli batch /data/folder --recursive --parallel

# Database profiling
dataprof-cli database postgres://user:pass@host/db --table users

# Benchmark engines
dataprof-cli benchmark data.csv

# Streaming mode for large files
dataprof-cli analyze large_file.csv --streaming

# JSON output for automation
dataprof-cli analyze data.csv --format json

DataProf Batch Report

Need help? Run dataprof-cli --help or dataprof-cli <command> --help for detailed options.

Python Bindings

pip install dataprof

import dataprof

# Comprehensive quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Access individual quality dimensions
metrics = report.data_quality_metrics
print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
print(f"Consistency: {metrics.data_type_consistency:.1f}%")
print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")

# Async database profiling (requires python-async feature)
import asyncio

async def profile_db():
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users LIMIT 1000",
        batch_size=1000,
        calculate_quality=True
    )
    print(f"Quality score: {result['quality'].overall_score:.1%}")

asyncio.run(profile_db())

Note: Async database profiling requires building with --features python-async,database,postgres (or mysql/sqlite). See Async Support below.

Full Python API Documentation →

Rust Library

cargo add dataprof

use dataprof::*;

// High-performance Arrow processing for large files (>100MB)
// Requires compilation with: cargo build --features arrow
#[cfg(feature = "arrow")]
let profiler = DataProfiler::columnar();
#[cfg(feature = "arrow")]
let report = profiler.analyze_csv_file("large_dataset.csv")?;

// Standard adaptive profiling (recommended for most use cases)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

Development

Want to contribute or build from source? Here's what you need:

Prerequisites

Rust (latest stable via rustup)
Docker (for database testing)

Quick Setup

git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release  # Build the project
docker-compose -f .devcontainer/docker-compose.yml up -d  # Start test databases

Feature Flags

dataprof uses optional features to keep compile times fast and binaries lean:

# Minimal build (CSV/JSON only, ~60s compile)
cargo build --release

# With Apache Arrow (columnar processing, ~90s compile)
cargo build --release --features arrow

# With Parquet support (requires arrow, ~95s compile)
cargo build --release --features parquet

# With database connectors
cargo build --release --features postgres,mysql,sqlite

# With Python async support (for async database profiling)
maturin develop --features python-async,database,postgres

# All features (full functionality, ~130s compile)
cargo build --release --all-features

When to use Arrow?

✅ Files > 100MB with many columns (>20)
✅ Columnar data with uniform types
✅ Need maximum throughput (up to 13x faster)
❌ Small files (<10MB) - standard engine is faster
❌ Mixed/messy data - streaming engine handles better

When to use Parquet?

✅ Analytics workloads with columnar data
✅ Data lake architectures
✅ Integration with Spark, Pandas, PyArrow
✅ Efficient storage and compression
✅ Type-safe schema preservation

Async Support

DataProf supports asynchronous operations for non-blocking database profiling, both in Rust and Python.

Rust Async (Database Features)

Database connectors are fully async and use tokio runtime:

use dataprof::database::{DatabaseConfig, profile_database};

#[tokio::main]
async fn main() -> Result<()> {
    let config = DatabaseConfig {
        connection_string: "postgresql://localhost/mydb".to_string(),
        batch_size: 10000,
        ..Default::default()
    };

    let report = profile_database(config, "SELECT * FROM users").await?;
    println!("Profiled {} rows", report.total_rows);
    Ok(())
}

Available async features:

✅ Non-blocking database queries
✅ Concurrent query execution
✅ Streaming for large result sets
✅ Connection pooling with SQLx
✅ Retry logic with exponential backoff

Python Async (python-async Feature)

Enable async Python bindings for database profiling:

# Build with async support
maturin develop --features python-async,database,postgres

import asyncio
import dataprof

async def main():
    # Test connection
    connected = await dataprof.test_connection_async(
        "postgresql://user:pass@localhost/db"
    )

    # Get table schema
    columns = await dataprof.get_table_schema_async(
        "postgresql://user:pass@localhost/db",
        "users"
    )

    # Count rows
    count = await dataprof.count_table_rows_async(
        "postgresql://user:pass@localhost/db",
        "users"
    )

    # Profile database query
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users LIMIT 1000",
        batch_size=1000,
        calculate_quality=True
    )

    print(f"Quality score: {result['quality'].overall_score:.1%}")

asyncio.run(main())

Benefits:

✅ Non-blocking I/O for better performance
✅ Concurrent database profiling
✅ Integration with async Python frameworks (FastAPI, aiohttp, etc.)
✅ Efficient resource usage

See also: examples/async_database_example.py for complete examples.

Common Development Tasks

cargo test          # Run all tests
cargo bench         # Performance benchmarks
cargo fmt           # Format code
cargo clippy        # Code quality checks

Documentation

Privacy & Transparency

What DataProf Does - Complete transparency guide with source code verification

User Guides

Python API Reference - Full Python API documentation
Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
Database Connectors - Production database connectivity
Apache Arrow Integration - Columnar processing guide
CLI Usage Guide - Complete CLI reference

Developer Guides

Development Guide - Complete setup and contribution guide
Performance Guide - Optimization and benchmarking
Performance Benchmarks - Benchmark results and methodology

License

Licensed under the MIT License. See LICENSE for details.

dataprof 0.4.82

dataprof

Privacy & Transparency

CI/CD Integration

Quick Start

Installation

Basic Usage

More Features

Python Bindings

Rust Library

Development

Prerequisites

Quick Setup

Feature Flags

Async Support

Rust Async (Database Features)

Python Async (python-async Feature)

Common Development Tasks

Documentation

Privacy & Transparency

User Guides

Developer Guides

License