dataprof 0.4.84

High-performance data profiler with ISO 8000/25012 quality metrics for CSV, JSON/JSONL, and Parquet files
Documentation

CI Crates.io License PyPI Downloads

20x faster than pandas with unlimited streaming for large files. ISO 8000/25012 compliant quality metrics, automatic pattern detection (emails, IPs, IBANs, etc.), and comprehensive statistics (mean, median, skewness, kurtosis). Available as CLI, Rust library, or Python package.

🔒 Privacy First: 100% local processing, no telemetry, read-only DB access. See what dataprof analyzes →

Quick Start

CLI Installation

# Install from crates.io
cargo install dataprof

# Or use Python
pip install dataprof

CLI Usage

# Analyze a file
dataprof-cli analyze data.csv

# Generate HTML report
dataprof-cli report data.csv -o report.html

# Batch process directories
dataprof-cli batch /data/folder --recursive --parallel

# Database profiling
dataprof-cli database postgres://user:pass@host/db --table users

More options: dataprof-cli --help | Full CLI Guide

Python API

import dataprof

# Quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)

# Async database profiling
async def profile_db():
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users",
        batch_size=1000,
        calculate_quality=True
    )
    return result

Python Documentation | Integrations (Pandas, scikit-learn, Jupyter, Airflow, dbt)

Rust Library

use dataprof::*;

// Adaptive profiling (recommended)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

// Arrow for large files (>100MB, requires --features arrow)
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("large_dataset.csv")?;

Development

# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release

# Test databases (optional)
docker-compose -f .devcontainer/docker-compose.yml up -d

# Common tasks
cargo test          # Run tests
cargo bench         # Benchmarks
cargo clippy        # Linting

Development Guide | Performance Guide

Feature Flags

# Minimal (CSV/JSON only)
cargo build --release

# With Apache Arrow (large files >100MB)
cargo build --release --features arrow

# With Parquet support
cargo build --release --features parquet

# With databases
cargo build --release --features postgres,mysql,sqlite

# Python async support
maturin develop --features python-async,database,postgres

# All features
cargo build --release --all-features

When to use Arrow: Large files (>100MB), many columns (>20), uniform types When to use Parquet: Analytics, data lakes, Spark/Pandas integration

Documentation

User Guides: CLI Reference | Python API | Python Integrations | Database Connectors | Apache Arrow

Developer: Development Guide | Performance Guide | Benchmarks

Privacy: What DataProf Does - Complete transparency with source verification

🤝 Contributing

We welcome contributions from everyone! Whether you want to:

  • Fix a bug 🐛
  • Add a feature
  • Improve documentation 📚
  • Report an issue 📝

Quick Start for Contributors

  1. Fork & clone:

    git clone https://github.com/YOUR-USERNAME/dataprof.git
    cd dataprof
    
  2. Build & test:

    cargo build
    cargo test
    
  3. Create a feature branch:

    git checkout -b feature/your-feature-name
    
  4. Before submitting PR:

    cargo fmt --all
    cargo clippy --all --all-targets
    cargo test --all
    
  5. Submit a Pull Request with clear description

📖 Full Contributing Guide →

All contributions are welcome. Please read CONTRIBUTING.md for guidelines and our Code of Conduct.

License

MIT License - See LICENSE for details.