dataprof

DISCLAIMER FOR HUMAN READERS

dataprof, even if working, is in early-stage development, therefore you might encounter bugs, minor or even major ones during your data-quality exploration journey.

Report them appropriately by opening an issue or by mailing the maintainer for security issues.

Thanks for your time here!

High-performance data quality and ML readiness assessment library built in Rust. Delivers 20x better memory efficiency than pandas with unlimited file streaming, 30+ automated quality checks, and comprehensive ML readiness assessment. NEW in v0.4.6: Generates ready-to-use Python code snippets for each ML recommendation. Full Python bindings and production database connectivity included.

Quick Start

Python

pip install dataprof

import dataprof

# ML readiness assessment with actionable code snippets
ml_score = dataprof.ml_readiness_score("data.csv")
print(f"ML Readiness: {ml_score.readiness_level} ({ml_score.overall_score:.1f}%)")

# NEW: Get ready-to-use preprocessing code
for rec in ml_score.recommendations:
    if rec.code_snippet:
        print(f"📦 {rec.framework} code for {rec.category}")
        print(rec.code_snippet)

# Quality analysis with detailed reporting
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Production database profiling
profiles = dataprof.analyze_database("postgresql://user:pass@host/db", "users")

Rust

cargo add dataprof --features arrow

use dataprof::*;

// High-performance Arrow processing
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("large_dataset.csv")?;

CLI with Code Generation

# Generate ML readiness report with actionable code snippets
dataprof data.csv --ml-score --ml-code

# Generate complete Python preprocessing script
dataprof data.csv --ml-score --output-script preprocess.py

Development

Prerequisites

Rust (latest stable via rustup)
Docker (for database testing)

Setup

git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release  # Build project
docker-compose -f .devcontainer/docker-compose.yml up -d  # Start databases

Common Tasks

cargo test          # Run all tests
cargo bench         # Performance benchmarks
cargo fmt           # Format code
cargo clippy        # Code quality checks

Documentation

Development Guide - Complete setup and contribution guide
Python API Reference - Full Python API documentation
ML Features - Machine learning readiness assessment
Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
Database Connectors - Production database connectivity
Performance Guide - Optimization and benchmarking
Apache Arrow Integration - Columnar processing guide
CLI Usage Guide - Complete CLI reference
Performance Benchmarks - Benchmark results and methodology

License

Licensed under GPL-3.0. See LICENSE for details.

dataprof 0.4.6