dataprof
DISCLAIMER FOR HUMAN READERS
dataprof, even if working, is in early-stage development, therefore you might encounter bugs, minor or even major ones during your data-quality exploration journey.
Report them appropriately by opening an issue or by mailing the maintainer for security issues.
Thanks for your time here!
A fast, reliable data quality and ML readiness assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and 30+ automated quality checks. NEW in v0.4.61: Generate ready-to-use Python code snippets for each ML recommendation. Full Python bindings and production database connectivity included.
Perfect for data scientists, ML engineers, and anyone working with data who needs quick, reliable quality insights.
Try Online
No installation required! Test dataprof instantly with our web interface:
- Drag & drop your CSV (up to 50MB)
- Get ML readiness score in ~10 seconds
- Powered by dataprof v0.4.61 core engine
- Embeddable badges for your README
CI/CD Integration
Automate data quality checks in your workflows with our GitHub Action:
- name: DataProf ML Readiness Check
uses: AndreaBozzo/dataprof-actions@v1
with:
file: 'data/dataset.csv'
ml-threshold: 80
fail-on-issues: true
- Zero setup - works out of the box
- Smart analysis - ML readiness scoring with actionable insights
- Flexible - customizable thresholds and output formats
- Fast - typically completes in under 2 minutes
Perfect for validating datasets before training, ensuring data quality in pipelines, or generating automated quality reports.
Quick Start
Python
# ML readiness assessment with actionable code snippets
=
# NEW: Get ready-to-use preprocessing code
# Quality analysis with detailed reporting
=
# Production database profiling
=
Rust
use *;
// High-performance Arrow processing
let profiler = columnar;
let report = profiler.analyze_csv_file?;
CLI Usage
# Generate ML readiness report with actionable code snippets
# Generate complete Python preprocessing script
# Quick analysis with streaming for large files
Note: On Windows, the binary is named dataprof-cli.exe. Use cargo build --release to build from source.
Development
Want to contribute or build from source? Here's what you need:
Prerequisites
- Rust (latest stable via rustup)
- Docker (for database testing)
Quick Setup
Common Development Tasks
Documentation
- Development Guide - Complete setup and contribution guide
- Python API Reference - Full Python API documentation
- ML Features - Machine learning readiness assessment
- Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
- Database Connectors - Production database connectivity
- Performance Guide - Optimization and benchmarking
- Apache Arrow Integration - Columnar processing guide
- CLI Usage Guide - Complete CLI reference
- Performance Benchmarks - Benchmark results and methodology
License
Licensed under the MIT License. See LICENSE for details.