dataprof 0.3.0

A fast, lightweight CLI tool for CSV data profiling and analysis
Documentation
# DataProfiler ๐Ÿ“Š

[![CI](https://github.com/AndreaBozzo/dataprof/workflows/CI/badge.svg)](https://github.com/AndreaBozzo/dataprof/actions)
[![License](https://img.shields.io/github/license/AndreaBozzo/dataprof)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.70%2B-orange.svg)](https://www.rust-lang.org)
[![Crates.io](https://img.shields.io/crates/v/dataprof.svg)](https://crates.io/crates/dataprof)
[![PyPI](https://img.shields.io/pypi/v/dataprof.svg)](https://pypi.org/project/dataprof/)

**High-performance data quality library for production pipelines**

๐Ÿ—๏ธ **Library-first design** for easy integration โ€ข โšก **10x faster** than pandas โ€ข ๐ŸŒŠ **Handles datasets larger than RAM** โ€ข ๐Ÿ” **Robust quality checking** for dirty data

![DataProfiler HTML Report](assets/animations/HTML.gif)

## ๐Ÿš€ Quick Start

### As a Rust Library

```bash
cargo add dataprof
```

```rust
use dataprof::*;

// Simple analysis
let profiles = analyze_csv("data.csv")?;

// Quality checking with streaming for large files
let report = analyze_csv_with_quality("large_dataset.csv")?;
if report.quality_score()? < 80.0 {
    println!("โš ๏ธ Data quality issues detected!");
    for issue in report.issues {
        println!("- {}: {}", issue.severity, issue.message);
    }
}

// Advanced configuration
let profiler = DataProfiler::builder()
    .streaming(true)
    .quality_config(QualityConfig::strict())
    .sampling_strategy(SamplingStrategy::reservoir(10000))
    .build()?;

let report = profiler.analyze_file("dirty_data.csv")?;
```

### Integration Examples

<details>
<summary><b>๐Ÿ”ง Airflow Integration</b></summary>

```python
# Quality gate in Airflow DAG
from dataprof import quick_quality_check

def data_quality_check(**context):
    file_path = context['task_instance'].xcom_pull(task_ids='extract_data')
    quality_score = quick_quality_check(file_path)

    if quality_score < 80.0:
        raise AirflowException(f"Data quality too low: {quality_score}")

    return quality_score

quality_task = PythonOperator(
    task_id='check_data_quality',
    python_callable=data_quality_check,
    dag=dag
)
```
</details>

<details>
<summary><b>๐Ÿ“Š dbt Integration</b></summary>

```rust
// Generate dbt tests from profiling results
use dataprof::integrations::dbt;

let report = analyze_csv_with_quality("models/customers.csv")?;
dbt::generate_tests(&report, "tests/customers.yml")?;

// Creates tests like:
// - dbt_utils.not_null_proportion(columns=['email'], at_least=0.95)
// - dbt_utils.accepted_range(column_name='age', min_value=0, max_value=120)
```
</details>

<details>
<summary><b>๐Ÿ Python Bindings</b></summary>

```python
pip install dataprof

import dataprof

# Simple usage
profiles = dataprof.analyze_csv("data.csv")
quality_report = dataprof.analyze_with_quality("data.csv")

# Pandas integration
import pandas as pd
df = pd.read_csv("large_file.csv")
# DataProfiler handles larger datasets that crash pandas
profiles = dataprof.analyze_dataframe(df)
```
</details>

### CLI Usage

```bash
# Install binary from GitHub releases
curl -L https://github.com/AndreaBozzo/dataprof/releases/latest/download/dataprof-linux.tar.gz | tar xz

# Basic analysis
./dataprof data.csv --quality

# Streaming for large files
./dataprof huge_dataset.csv --streaming --progress

# Generate HTML report
./dataprof data.csv --quality --html report.html
```

## ๐ŸŽฏ Real-World Use Cases

### Production Data Pipeline Quality Gates
```rust
// Block pipeline on poor data quality
let quality_score = quick_quality_check("incoming/batch_2024_01_15.csv")?;
if quality_score < 85.0 {
    return Err("Data quality below production threshold");
}
```

### ML Model Input Validation
```rust
// Detect data drift in production
let baseline = analyze_csv("training_data.csv")?;
let current = analyze_csv("production_input.csv")?;
let drift_detected = detect_distribution_drift(&baseline, &current)?;
```

### ETL Process Monitoring
```rust
// Continuous monitoring of data warehouse loads
for file in glob("warehouse/daily/*.csv")? {
    let report = analyze_csv_with_quality(&file)?;
    send_quality_metrics(&report, "datadog://metrics")?;
}
```

## โšก Performance vs Alternatives

| Tool | 100MB CSV | Memory Usage | Handles >RAM |
|------|-----------|--------------|--------------|
| **DataProfiler** | **2.1s** | **45MB** | **โœ… Yes** |
| pandas.describe() | 8.4s | 380MB | โŒ No |
| Great Expectations | 12.1s | 290MB | โŒ No |
| deequ (Spark) | 15.3s | 1.2GB | โœ… Yes |

*Benchmarks on E5-2670v3, 16GB RAM, SSD*

## ๐Ÿ“Š Example Output

### Quality Issues Detection

```
โš ๏ธ  QUALITY ISSUES FOUND: (15)

1. ๐Ÿ”ด CRITICAL [email]: 2 null values (20.0%)
2. ๐Ÿ”ด CRITICAL [order_date]: Mixed date formats
   - YYYY-MM-DD: 5 rows
   - DD/MM/YYYY: 2 rows
   - DD-MM-YYYY: 1 rows
3. ๐ŸŸก WARNING [phone]: Invalid format patterns detected
4. ๐ŸŸก WARNING [amount]: Outlier values (999999.99 vs mean 156.78)

๐Ÿ“Š Summary: 2 critical, 13 warnings
Quality Score: 73.2/100 - BELOW THRESHOLD
```

### Quality Issues Detection

```
๐Ÿ“Š DataProfiler - Standard Analysis

๐Ÿ“ sales_data_problematic.csv | 0.0 MB | 9 columns

โš ๏ธ  QUALITY ISSUES FOUND: (15)

1. ๐Ÿ”ด CRITICAL [email]: 2 null values (20.0%)
2. ๐Ÿ”ด CRITICAL [order_date]: Mixed date formats
     - DD/MM/YYYY: 2 rows
     - YYYY-MM-DD: 5 rows
     - YYYY/MM/DD: 1 rows
     - DD-MM-YYYY: 1 rows
3. ๐ŸŸก WARNING [phone]: 1 null values (10.0%)
4. ๐ŸŸก WARNING [amount]: 1 duplicate values

๐Ÿ“Š Summary: 2 critical 13 warnings
```

## ๐Ÿ—๏ธ Architecture & Features

### Why DataProfiler?

**Built for Production Data Pipelines:**
- โšก **10x faster** than pandas on large datasets
- ๐ŸŒŠ **Stream processing** - analyze 100GB+ files without loading into memory
- ๐Ÿ›ก๏ธ **Robust parsing** - handles malformed CSV, mixed data types, encoding issues
- ๐Ÿ” **Smart quality detection** - catches issues pandas misses
- ๐Ÿ—๏ธ **Library-first** - easy integration into existing workflows

### Core Capabilities

| Feature | DataProfiler | pandas | Great Expectations |
|---------|-------------|--------|-------------------|
| **Large File Support** | โœ… Streaming | โŒ Memory bound | โŒ Memory bound |
| **Quality Detection** | โœ… Built-in | โš ๏ธ Manual | โœ… Rules-based |
| **Performance** | โœ… SIMD accelerated | โš ๏ธ Single-threaded | โŒ Spark overhead |
| **Integration** | โœ… Library API | โœ… Native Python | โš ๏ธ Configuration heavy |
| **Dirty Data** | โœ… Robust parsing | โŒ Fails on errors | โš ๏ธ Schema required |

### Technical Features

- **๐Ÿš€ SIMD Acceleration**: Vectorized operations for 10x numeric performance
- **๐ŸŒŠ True Streaming**: Process files larger than available RAM
- **๐Ÿง  Smart Algorithms**: Vitter's reservoir sampling, statistical profiling
- **๐Ÿ›ก๏ธ Robust Parsing**: Handles malformed CSV, mixed encodings, variable columns
- **โš ๏ธ Quality Detection**: Null patterns, duplicates, outliers, format inconsistencies
- **๐Ÿ“Š Multiple Formats**: CSV, JSON, JSONL with unified API
- **๐Ÿ”ง Configurable**: Sampling strategies, quality thresholds, output formats

## ๐Ÿ“‹ All Options

```bash
Fast CSV data profiler with quality checking - v0.3.0 Streaming Edition

Usage: dataprof [OPTIONS] <FILE>

Arguments:
  <FILE>  CSV file to analyze

Options:
  -q, --quality                  Enable quality checking (shows data issues)
      --html <HTML>              Generate HTML report (requires --quality)
      --streaming                Use streaming engine for large files (v0.3.0)
      --progress                 Show progress during processing (requires --streaming)
      --chunk-size <CHUNK_SIZE>  Override chunk size for streaming (default: adaptive)
      --sample <SAMPLE>          Enable sampling for very large datasets
  -h, --help                     Print help
```

## ๐Ÿ› ๏ธ As a Library

Add to your `Cargo.toml`:

```toml
[dependencies]
dataprof = { git = "https://github.com/AndreaBozzo/dataprof.git" }
```

```rust
use dataprof::analyze_csv;

let profiles = analyze_csv("data.csv")?;
for profile in profiles {
    println!("{}: {:?} ({}% nulls)",
             profile.name,
             profile.data_type,
             profile.null_count as f32 / profile.total_count as f32 * 100.0);
}
```

## ๐ŸŽฏ Supported Formats

- **CSV**: Comma-separated values with auto-delimiter detection
- **JSON**: JSON arrays with object records
- **JSONL**: Line-delimited JSON (one object per line)

## โšก Performance

- **Small files** (<10MB): Analysis in milliseconds
- **Large files** (100MB+): Smart sampling maintains accuracy
- **SIMD optimized**: 10x faster numeric computations on modern CPUs
- **Memory bounded**: Process files larger than available RAM
- **Example**: 115MB file analyzed in 2.9s with 99.6% accuracy

## ๐Ÿงช Development

Requirements: Rust 1.70+

### Quick Setup

```bash
# Automated setup (installs pre-commit hooks, tools)
bash scripts/setup-dev.sh        # Linux/macOS
# or
pwsh scripts/setup-dev.ps1        # Windows

# Manual setup
cargo build --release             # Build optimized
cargo test                        # Run all tests
cargo fmt                         # Format code
cargo clippy                      # Lint code
```

### Development Tools

#### Using just (Recommended)

```bash
cargo install just                # Install task runner
just                              # Show all tasks
just dev                          # Quick development cycle
just check                        # Full quality checks
just test-lib                     # Fast library tests
just example data.csv             # Run example analysis
```

#### Using pre-commit (Quality Gates)

```bash
pip install pre-commit            # Install pre-commit
pre-commit install                # Install hooks
pre-commit run --all-files        # Run all checks
```

#### Manual Commands

```bash
cargo build --release             # Build optimized
cargo test --lib                  # Fast library tests
cargo test --test integration_tests # Integration tests
cargo test --test v03_comprehensive # Comprehensive tests
cargo fmt --all                   # Format code
cargo clippy --all-targets --all-features -- -D warnings # Lint
```

### Quality Assurance

The project uses automated quality checks:

- **Pre-commit hooks**: Format, lint, test on every commit
- **Continuous Integration**: 61/61 tests passing (100% success rate)
- **Code coverage**: All major functions tested
- **Performance benchmarks**: Verified 10x SIMD improvements

## ๐Ÿค Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development guidelines.

## ๐Ÿ“„ License

This project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.