<div align="center">
<img src="assets/images/logo.png" alt="dataprof logo" width="400" height="auto" />
<h1>dataprof</h1>
<p>
<strong>The High-Performance Profiler for Large Datasets</strong>
</p>
[](https://github.com/AndreaBozzo/dataprof/actions)
[](https://codecov.io/gh/AndreaBozzo/dataprof)
[](https://crates.io/crates/dataprof)
[](LICENSE)
[](https://pepy.tech/projects/dataprof)
</div>
<br />
> **Profile 50GB datasets in seconds on your laptop.**
DataProf is built for Data Scientists and Engineers who need to understand their data *fast*. No more `MemoryError` when trying to profile a CSV larger than your RAM.
**Pandas-Profiling vs DataProf on a 10GB CSV:**
| **Memory Usage** | 12GB+ (Crashes) | **< 100MB (Streaming)** |
| **Speed** | 15+ minutes | **45 seconds** |
| **Implementation** | Python (Slow) | **Rust (Blazing Fast)** |
## Quick Start
### Installation
The easiest way to get started is via pip:
```bash
pip install dataprof
```
### Python Usage
Forget complex configurations. Just point to your file:
```python
import dataprof
# Analyze a huge file without crashing memory
# Generates a report.html with quality metrics and distributions
dataprof.profile("huge_dataset.csv").save("report.html")
```
### CLI & Rust Usage (Advanced)
If you prefer the command line or are a Rust developer:
```bash
# Install via cargo
cargo install dataprof
# Generate report from CLI
dataprof-cli report huge_data.csv -o report.html
```
## 💡 Key Features
- **No Size Limits**: Profiles files larger than RAM using streaming and memory mapping.
- **Blazing Fast**: Written in Rust with SIMD acceleration.
- **Privacy Guaranteed**: Data never leaves your machine.
- **Format Support**: CSV, Parquet, JSON/L, and Databases (Postgres, MySQL, etc.).
- **Smart Detection**: Automatically identifies Emails, IPs, IBANs, Credit Cards, and more.
## 📊 Beautiful Reports
<div align="center">
<p><strong>Interactive Demo</strong><br/>
<em>Animated walkthrough of dataprof features and dashboards</em></p>
<img src="assets/animations/dataprof_demo_minimal.gif" alt="DataProf Demo" width="100%" />
<br/><br/>
<p><strong>Single File Analysis</strong><br/>
<em>Interactive dashboards with quality scoring and distributions</em></p>
<img src="assets/images/dataprofhtml2026.png" alt="Single Report Dashboard" width="100%" />
<br/><br/>
<p><strong>Batch Processing Dashboard</strong><br/>
<em>Aggregate metrics from hundreds of files in one view</em></p>
<img src="assets/images/dataprofbatch2026.png" alt="Batch Dashboard" width="100%" />
</div>
## Documentation
- **[Python API Reference](docs/python/README.md)**
- **[CLI Guide](docs/guides/CLI_USAGE_GUIDE.md)**
### Advanced Examples
**Batch Processing (Python)**
```python
# Process a whole directory of files in parallel
result = dataprof.batch_analyze_directory("/data_folder", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")
```
**Database Integration (Python)**
```python
# Profile a SQL query directly
await dataprof.analyze_database_async(
"postgresql://user:pass@localhost/db",
"SELECT * FROM sales_data_2024"
)
```
**Rust Library Usage**
```rust
use dataprof::*;
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;
println!("Quality Score: {}", report.quality_score());
```
## Development
```bash
# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release
# Test databases (optional)
docker-compose -f .devcontainer/compose.yml up -d
# Common tasks
cargo test # Run tests
cargo bench # Benchmarks
cargo clippy # Linting
```
### Feature Flags
```bash
# Minimal (CSV/JSON only)
cargo build --release
# With Apache Arrow (large files >100MB)
cargo build --release --features arrow
# With Parquet support
cargo build --release --features parquet
# With databases
cargo build --release --features postgres,mysql,sqlite
# Python async support
maturin develop --features python-async,database,postgres
# All features
cargo build --release --all-features
```
**When to use Arrow:** Large files (>100MB), many columns (>20), uniform types
**When to use Parquet:** Analytics, data lakes, Spark/Pandas integration
## Documentation
**User Guides:**
[CLI Reference](docs/guides/CLI_USAGE_GUIDE.md)
[Database Connectors](docs/guides/database-connectors.md)
## 🤝 Contributing
We welcome contributions from everyone! Whether you want to:
- **Fix a bug** 🐛
- **Add a feature** ✨
- **Improve documentation** 📚
- **Report an issue** 📝
### Quick Start for Contributors
1. **Fork & clone:**
```bash
git clone https://github.com/YOUR-USERNAME/dataprof.git
cd dataprof
```
2. **Build & test:**
```bash
cargo build
cargo test
```
3. **Create a feature branch:**
```bash
git checkout -b feature/your-feature-name
```
4. **Before submitting PR:**
```bash
cargo fmt --all
cargo clippy --all --all-targets
cargo test --all
```
5. **Submit a Pull Request** with clear description
All contributions are welcome. Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines and our [Code of Conduct](CODE_OF_CONDUCT.md).
## License
Dual-licensed under either:
- [MIT License](LICENSE)
- [Apache License, Version 2.0](LICENSE-APACHE)
You may use this project under the terms of either license.