DataProfiler 📊
DISCLAIMER FOR HUMAN READERS
dataprof, even if working, is in early-stage development, therefore you might encounter bugs, minor or even major ones during your data-quality exploration journey.
Report them appropriately by opening an issue or by mailing the maintainer for security issues.
Thanks for your time here!
High-performance data quality and ML readiness assessment library
DataProfiler v0.4.4 delivers 20x better memory efficiency than pandas, unlimited file streaming, 30+ automated quality checks, and NEW: comprehensive ML readiness assessment. Built in Rust with full Python bindings and production-ready database connectivity.


✨ Key Features (v0.4.4)
- 🤖 ML Readiness Assessment: Automated feature analysis, blocking issues detection, preprocessing recommendations
- ⚡ High Performance: 20x more memory efficient than pandas with Apache Arrow integration
- 🌊 Scalable: Stream processing for files larger than RAM (tested up to 100GB)
- 🔍 Smart Quality Detection: 30+ automated checks for nulls, duplicates, outliers, format issues
- 🗃️ Production Database Support: PostgreSQL, MySQL, SQLite, DuckDB with SSL/TLS and retry logic
- 🐍 Complete Python Integration: Native bindings with pandas, scikit-learn, Jupyter support
🚀 Quick Start
Python
# NEW v0.4.4: ML readiness assessment
=
# Quality analysis with detailed reporting
=
# Production database profiling with SSL
=
Rust
use *;
// High-performance Arrow processing
let profiler = columnar;
let report = profiler.analyze_csv_file?;
CLI
# Basic profiling
# Database profiling
# Large files with progress
📊 Performance
| Tool | 100MB CSV | Memory | Quality Checks | >RAM Support |
|---|---|---|---|---|
| DataProfiler (Arrow) | 0.5s | 30MB | ✅ 30+ checks | ✅ |
| DataProfiler (Standard) | 2.1s | 45MB | ✅ 30+ checks | ✅ |
| pandas.describe() | 8.4s | 380MB | ❌ Basic stats | ❌ |
| Great Expectations | 12.1s | 290MB | ✅ Rule-based | ❌ |
💡 Real-World Examples
NEW v0.4.4: ML Pipeline Integration
# Step 1: ML readiness assessment guides preprocessing
=
=
# Step 2: Auto-categorize features for scikit-learn pipeline
=
=
# Step 3: Build preprocessing pipeline based on DataProf recommendations
=
Production Quality Gate
=
=
return ,
Database Monitoring with ML Assessment
# Monitor daily data loads with ML readiness
|
📖 Documentation
| Guide | Description |
|---|---|
| Python API Reference | Complete function and class reference |
| ML Features Guide | NEW: ML readiness assessment and preprocessing recommendations |
| Python Integrations | Pandas, scikit-learn, Jupyter, Airflow workflows |
| Database Connectors | Production PostgreSQL, MySQL, SQLite, DuckDB with SSL/TLS |
| CLI Usage Guide | Comprehensive CLI with progress indicators and validation |
Resources: CHANGELOG • CONTRIBUTING • LICENSE
🛠️ Development
# Quick setup
# Build and test
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
📄 License
Licensed under GPL-3.0 • Commercial use allowed with source disclosure