dataprof 0.4.82

High-performance data profiler with ISO 8000/25012 quality metrics for CSV, JSON/JSONL, and Parquet files
Documentation

dataprof

CI License Rust Crates.io PyPI Downloads

A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.

Automatic Pattern Detection - Identifies 16+ common data patterns including emails, phone numbers, IP addresses, coordinates, IBAN, file paths, and more.

Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.

Privacy & Transparency

DataProf processes all data locally on your machine. Zero telemetry, zero external data transmission.

Read exactly what DataProf analyzes →

  • 100% local processing - your data never leaves your machine
  • No telemetry or tracking
  • Open source & fully auditable
  • Read-only database access (when using DB features)

Complete transparency: Every metric, calculation, and data point is documented with source code references for independent verification.

CI/CD Integration

Automate data quality checks in your workflows with our GitHub Action:

- name: DataProf Quality Check
  uses: AndreaBozzo/dataprof-actions@v1
  with:
    file: 'data/dataset.csv'
    quality-threshold: 80
    fail-on-issues: true
    # Batch mode (NEW)
    recursive: true
    output-html: 'quality-report.html'

Get the Action →

  • Zero setup - works out of the box
  • ISO 8000/25012 compliant - industry-standard quality metrics
  • Batch processing - analyze entire directories recursively
  • Flexible - customizable thresholds and output formats
  • Fast - typically completes in under 2 minutes

Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.

Quick Start

Installation

# Install from crates.io (recommended)
cargo install dataprof

# Or build from source
git clone https://github.com/AndreaBozzo/dataprof
cd dataprof
cargo install --path .

That's it! Now you can use dataprof-cli from anywhere.

Basic Usage

# Analyze a CSV file
dataprof-cli analyze data.csv

# Get detailed analysis
dataprof-cli analyze data.csv --detailed

# Generate HTML report
dataprof-cli report data.csv -o report.html

# Analyze Parquet files (requires --features parquet)
dataprof-cli analyze data.parquet

More Features

# Batch process entire directory
dataprof-cli batch /data/folder --recursive --parallel

# Database profiling
dataprof-cli database postgres://user:pass@host/db --table users

# Benchmark engines
dataprof-cli benchmark data.csv

# Streaming mode for large files
dataprof-cli analyze large_file.csv --streaming

# JSON output for automation
dataprof-cli analyze data.csv --format json

DataProf Batch Report

Need help? Run dataprof-cli --help or dataprof-cli <command> --help for detailed options.

Python Bindings

pip install dataprof
import dataprof

# Comprehensive quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Access individual quality dimensions
metrics = report.data_quality_metrics
print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
print(f"Consistency: {metrics.data_type_consistency:.1f}%")
print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")

# Async database profiling (requires python-async feature)
import asyncio

async def profile_db():
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users LIMIT 1000",
        batch_size=1000,
        calculate_quality=True
    )
    print(f"Quality score: {result['quality'].overall_score:.1%}")

asyncio.run(profile_db())

Note: Async database profiling requires building with --features python-async,database,postgres (or mysql/sqlite). See Async Support below.

Full Python API Documentation →

Rust Library

cargo add dataprof
use dataprof::*;

// High-performance Arrow processing for large files (>100MB)
// Requires compilation with: cargo build --features arrow
#[cfg(feature = "arrow")]
let profiler = DataProfiler::columnar();
#[cfg(feature = "arrow")]
let report = profiler.analyze_csv_file("large_dataset.csv")?;

// Standard adaptive profiling (recommended for most use cases)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

Development

Want to contribute or build from source? Here's what you need:

Prerequisites

  • Rust (latest stable via rustup)
  • Docker (for database testing)

Quick Setup

git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release  # Build the project
docker-compose -f .devcontainer/docker-compose.yml up -d  # Start test databases

Feature Flags

dataprof uses optional features to keep compile times fast and binaries lean:

# Minimal build (CSV/JSON only, ~60s compile)
cargo build --release

# With Apache Arrow (columnar processing, ~90s compile)
cargo build --release --features arrow

# With Parquet support (requires arrow, ~95s compile)
cargo build --release --features parquet

# With database connectors
cargo build --release --features postgres,mysql,sqlite

# With Python async support (for async database profiling)
maturin develop --features python-async,database,postgres

# All features (full functionality, ~130s compile)
cargo build --release --all-features

When to use Arrow?

  • ✅ Files > 100MB with many columns (>20)
  • ✅ Columnar data with uniform types
  • ✅ Need maximum throughput (up to 13x faster)
  • ❌ Small files (<10MB) - standard engine is faster
  • ❌ Mixed/messy data - streaming engine handles better

When to use Parquet?

  • ✅ Analytics workloads with columnar data
  • ✅ Data lake architectures
  • ✅ Integration with Spark, Pandas, PyArrow
  • ✅ Efficient storage and compression
  • ✅ Type-safe schema preservation

Async Support

DataProf supports asynchronous operations for non-blocking database profiling, both in Rust and Python.

Rust Async (Database Features)

Database connectors are fully async and use tokio runtime:

use dataprof::database::{DatabaseConfig, profile_database};

#[tokio::main]
async fn main() -> Result<()> {
    let config = DatabaseConfig {
        connection_string: "postgresql://localhost/mydb".to_string(),
        batch_size: 10000,
        ..Default::default()
    };

    let report = profile_database(config, "SELECT * FROM users").await?;
    println!("Profiled {} rows", report.total_rows);
    Ok(())
}

Available async features:

  • ✅ Non-blocking database queries
  • ✅ Concurrent query execution
  • ✅ Streaming for large result sets
  • ✅ Connection pooling with SQLx
  • ✅ Retry logic with exponential backoff

Python Async (python-async Feature)

Enable async Python bindings for database profiling:

# Build with async support
maturin develop --features python-async,database,postgres
import asyncio
import dataprof

async def main():
    # Test connection
    connected = await dataprof.test_connection_async(
        "postgresql://user:pass@localhost/db"
    )

    # Get table schema
    columns = await dataprof.get_table_schema_async(
        "postgresql://user:pass@localhost/db",
        "users"
    )

    # Count rows
    count = await dataprof.count_table_rows_async(
        "postgresql://user:pass@localhost/db",
        "users"
    )

    # Profile database query
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users LIMIT 1000",
        batch_size=1000,
        calculate_quality=True
    )

    print(f"Quality score: {result['quality'].overall_score:.1%}")

asyncio.run(main())

Benefits:

  • ✅ Non-blocking I/O for better performance
  • ✅ Concurrent database profiling
  • ✅ Integration with async Python frameworks (FastAPI, aiohttp, etc.)
  • ✅ Efficient resource usage

See also: examples/async_database_example.py for complete examples.

Common Development Tasks

cargo test          # Run all tests
cargo bench         # Performance benchmarks
cargo fmt           # Format code
cargo clippy        # Code quality checks

Documentation

Privacy & Transparency

User Guides

Developer Guides

License

Licensed under the MIT License. See LICENSE for details.