dataprof
DISCLAIMER FOR HUMAN READERS
dataprof, even if working, is in early-stage development, therefore you might encounter bugs, minor or even major ones during your data-quality exploration journey.
Report them appropriately by opening an issue or by mailing the maintainer for security issues.
Thanks for your time here!
High-performance data quality and ML readiness assessment library built in Rust. Delivers 20x better memory efficiency than pandas with unlimited file streaming, 30+ automated quality checks, and comprehensive ML readiness assessment. NEW in v0.4.6: Generates ready-to-use Python code snippets for each ML recommendation. Full Python bindings and production database connectivity included.
Quick Start
Python
# ML readiness assessment with actionable code snippets
=
# NEW: Get ready-to-use preprocessing code
# Quality analysis with detailed reporting
=
# Production database profiling
=
Rust
use *;
// High-performance Arrow processing
let profiler = columnar;
let report = profiler.analyze_csv_file?;
CLI with Code Generation
# Generate ML readiness report with actionable code snippets
# Generate complete Python preprocessing script
Development
Prerequisites
- Rust (latest stable via rustup)
- Docker (for database testing)
Setup
Common Tasks
Documentation
- Development Guide - Complete setup and contribution guide
- Python API Reference - Full Python API documentation
- ML Features - Machine learning readiness assessment
- Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
- Database Connectors - Production database connectivity
- Performance Guide - Optimization and benchmarking
- Apache Arrow Integration - Columnar processing guide
- CLI Usage Guide - Complete CLI reference
- Performance Benchmarks - Benchmark results and methodology
License
Licensed under GPL-3.0. See LICENSE for details.