dataprof 0.3.5

A fast, lightweight CLI tool for CSV data profiling and analysis
Documentation

DataProfiler ๐Ÿ“Š

CI License Rust Crates.io PyPI

High-performance data quality library for production pipelines

๐Ÿ—๏ธ Library-first design for easy integration โ€ข โšก 10x faster than pandas โ€ข ๐ŸŒŠ Handles datasets larger than RAM โ€ข ๐Ÿ” Robust quality checking for dirty data โ€ข ๐Ÿ—ƒ๏ธ Direct database connectivity

๐Ÿ“ฆ Available for both Rust and Python โ€ข ๐Ÿ pip install dataprof โ€ข ๐Ÿฆ€ cargo add dataprof

๐Ÿ—ƒ๏ธ NEW: Database Connectors - Profile data directly from PostgreSQL, MySQL, SQLite, and DuckDB without exports!

DataProfiler HTML Report

๐Ÿš€ Quick Start

๐Ÿ Python Users

pip install dataprof
import dataprof

# Analyze CSV files with ease
profiles = dataprof.analyze_csv_file("data.csv")
for profile in profiles:
    print(f"{profile.name}: {profile.data_type} (null: {profile.null_percentage:.1f}%)")

# Quality checking with detailed reports
report = dataprof.analyze_csv_with_quality("dataset.csv")
print(f"Quality score: {report.quality_score():.1f}%")

๐Ÿ‘‰ Complete Python Guide โ†’

๐Ÿ—ƒ๏ธ Database Profiling (NEW!)

# Install with database support
pip install dataprof[database]
# or
cargo install dataprof --features database
# Profile PostgreSQL table directly
dataprof users --database "postgresql://user:pass@localhost:5432/mydb" --quality

# Analyze with custom query
dataprof . --database "mysql://root:pass@localhost:3306/shop" \
  --query "SELECT * FROM orders WHERE date > '2024-01-01'" \
  --quality --html report.html

# DuckDB analytics
dataprof sales --database "./analytics.duckdb" --quality --batch-size 50000

๐Ÿ‘‰ Complete Database Guide โ†’

๐Ÿฆ€ Rust Library

cargo add dataprof
use dataprof::*;

// Simple analysis
let profiles = analyze_csv("data.csv")?;

// Quality checking with streaming for large files
let report = analyze_csv_with_quality("large_dataset.csv")?;
if report.quality_score()? < 80.0 {
    println!("โš ๏ธ Data quality issues detected!");
    for issue in report.issues {
        println!("- {}: {}", issue.severity, issue.message);
    }
}

// Advanced configuration
let profiler = DataProfiler::builder()
    .streaming(true)
    .quality_config(QualityConfig::strict())
    .sampling_strategy(SamplingStrategy::reservoir(10000))
    .build()?;

let report = profiler.analyze_file("dirty_data.csv")?;

Integration Examples

# Quality gate in Airflow DAG
from dataprof import quick_quality_check

def data_quality_check(**context):
    file_path = context['task_instance'].xcom_pull(task_ids='extract_data')
    quality_score = quick_quality_check(file_path)

    if quality_score < 80.0:
        raise AirflowException(f"Data quality too low: {quality_score}")

    return quality_score

quality_task = PythonOperator(
    task_id='check_data_quality',
    python_callable=data_quality_check,
    dag=dag
)
// Generate dbt tests from profiling results
use dataprof::integrations::dbt;

let report = analyze_csv_with_quality("models/customers.csv")?;
dbt::generate_tests(&report, "tests/customers.yml")?;

// Creates tests like:
// - dbt_utils.not_null_proportion(columns=['email'], at_least=0.95)
// - dbt_utils.accepted_range(column_name='age', min_value=0, max_value=120)
pip install dataprof

import dataprof

# Simple usage
profiles = dataprof.analyze_csv("data.csv")
quality_report = dataprof.analyze_with_quality("data.csv")

# Pandas integration
import pandas as pd
df = pd.read_csv("large_file.csv")
# DataProfiler handles larger datasets that crash pandas
profiles = dataprof.analyze_dataframe(df)

CLI Usage

# Install binary from GitHub releases
curl -L https://github.com/AndreaBozzo/dataprof/releases/latest/download/dataprof-linux.tar.gz | tar xz

# Basic analysis
./dataprof data.csv --quality

# Streaming for large files
./dataprof huge_dataset.csv --streaming --progress

# Generate HTML report
./dataprof data.csv --quality --html report.html

๐ŸŽฏ Real-World Use Cases

Production Data Pipeline Quality Gates

// Block pipeline on poor data quality
let quality_score = quick_quality_check("incoming/batch_2024_01_15.csv")?;
if quality_score < 85.0 {
    return Err("Data quality below production threshold");
}

ML Model Input Validation

// Detect data drift in production
let baseline = analyze_csv("training_data.csv")?;
let current = analyze_csv("production_input.csv")?;
let drift_detected = detect_distribution_drift(&baseline, &current)?;

ETL Process Monitoring

// Continuous monitoring of data warehouse loads
for file in glob("warehouse/daily/*.csv")? {
    let report = analyze_csv_with_quality(&file)?;
    send_quality_metrics(&report, "datadog://metrics")?;
}

โšก Performance vs Alternatives

Tool 100MB CSV Memory Usage Handles >RAM
DataProfiler 2.1s 45MB โœ… Yes
pandas.describe() 8.4s 380MB โŒ No
Great Expectations 12.1s 290MB โŒ No
deequ (Spark) 15.3s 1.2GB โœ… Yes

Benchmarks on E5-2670v3, 16GB RAM, SSD

๐Ÿ“Š Example Output

Quality Issues Detection

โš ๏ธ  QUALITY ISSUES FOUND: (15)

1. ๐Ÿ”ด CRITICAL [email]: 2 null values (20.0%)
2. ๐Ÿ”ด CRITICAL [order_date]: Mixed date formats
   - YYYY-MM-DD: 5 rows
   - DD/MM/YYYY: 2 rows
   - DD-MM-YYYY: 1 rows
3. ๐ŸŸก WARNING [phone]: Invalid format patterns detected
4. ๐ŸŸก WARNING [amount]: Outlier values (999999.99 vs mean 156.78)

๐Ÿ“Š Summary: 2 critical, 13 warnings
Quality Score: 73.2/100 - BELOW THRESHOLD

Quality Issues Detection

๐Ÿ“Š DataProfiler - Standard Analysis

๐Ÿ“ sales_data_problematic.csv | 0.0 MB | 9 columns

โš ๏ธ  QUALITY ISSUES FOUND: (15)

1. ๐Ÿ”ด CRITICAL [email]: 2 null values (20.0%)
2. ๐Ÿ”ด CRITICAL [order_date]: Mixed date formats
     - DD/MM/YYYY: 2 rows
     - YYYY-MM-DD: 5 rows
     - YYYY/MM/DD: 1 rows
     - DD-MM-YYYY: 1 rows
3. ๐ŸŸก WARNING [phone]: 1 null values (10.0%)
4. ๐ŸŸก WARNING [amount]: 1 duplicate values

๐Ÿ“Š Summary: 2 critical 13 warnings

๐Ÿ—๏ธ Architecture & Features

Why DataProfiler?

Built for Production Data Pipelines:

  • โšก 10x faster than pandas on large datasets
  • ๐ŸŒŠ Stream processing - analyze 100GB+ files without loading into memory
  • ๐Ÿ›ก๏ธ Robust parsing - handles malformed CSV, mixed data types, encoding issues
  • ๐Ÿ” Smart quality detection - catches issues pandas misses
  • ๐Ÿ—๏ธ Library-first - easy integration into existing workflows

Core Capabilities

Feature DataProfiler pandas Great Expectations
Large File Support โœ… Streaming โŒ Memory bound โŒ Memory bound
Quality Detection โœ… Built-in โš ๏ธ Manual โœ… Rules-based
Performance โœ… SIMD accelerated โš ๏ธ Single-threaded โŒ Spark overhead
Integration โœ… Library API โœ… Native Python โš ๏ธ Configuration heavy
Dirty Data โœ… Robust parsing โŒ Fails on errors โš ๏ธ Schema required

Technical Features

  • ๐Ÿš€ SIMD Acceleration: Vectorized operations for 10x numeric performance
  • ๐ŸŒŠ True Streaming: Process files larger than available RAM
  • ๐Ÿง  Smart Algorithms: Vitter's reservoir sampling, statistical profiling
  • ๐Ÿ›ก๏ธ Robust Parsing: Handles malformed CSV, mixed encodings, variable columns
  • โš ๏ธ Quality Detection: Null patterns, duplicates, outliers, format inconsistencies
  • ๐Ÿ“Š Multiple Formats: CSV, JSON, JSONL with unified API
  • ๐Ÿ”ง Configurable: Sampling strategies, quality thresholds, output formats

๐Ÿ“‹ All Options

Fast CSV data profiler with quality checking - v0.3.5 Database Connectors & Memory Safety Edition

Usage: dataprof [OPTIONS] <FILE>

Arguments:
  <FILE>  CSV file to analyze

Options:
  -q, --quality                  Enable quality checking (shows data issues)
      --html <HTML>              Generate HTML report (requires --quality)
      --streaming                Use streaming engine for large files (v0.3.5)
      --progress                 Show progress during processing (requires --streaming)
      --chunk-size <CHUNK_SIZE>  Override chunk size for streaming (default: adaptive)
      --sample <SAMPLE>          Enable sampling for very large datasets
  -h, --help                     Print help

๐Ÿ› ๏ธ As a Library

Add to your Cargo.toml:

[dependencies]
dataprof = { git = "https://github.com/AndreaBozzo/dataprof.git" }
use dataprof::analyze_csv;

let profiles = analyze_csv("data.csv")?;
for profile in profiles {
    println!("{}: {:?} ({}% nulls)",
             profile.name,
             profile.data_type,
             profile.null_count as f32 / profile.total_count as f32 * 100.0);
}

๐ŸŽฏ Supported Formats

  • CSV: Comma-separated values with auto-delimiter detection
  • JSON: JSON arrays with object records
  • JSONL: Line-delimited JSON (one object per line)

โšก Performance

  • Small files (<10MB): Analysis in milliseconds
  • Large files (100MB+): Smart sampling maintains accuracy
  • SIMD optimized: 10x faster numeric computations on modern CPUs
  • Memory bounded: Process files larger than available RAM
  • Example: 115MB file analyzed in 2.9s with 99.6% accuracy

๐Ÿงช Development

Requirements: Rust 1.70+

Quick Setup

# Automated setup (installs pre-commit hooks, tools)
bash scripts/setup-dev.sh        # Linux/macOS
# or
pwsh scripts/setup-dev.ps1        # Windows

# Manual setup
cargo build --release             # Build optimized
cargo test                        # Run all tests
cargo fmt                         # Format code
cargo clippy                      # Lint code

Development Tools

Using just (Recommended)

cargo install just                # Install task runner
just                              # Show all tasks
just dev                          # Quick development cycle
just check                        # Full quality checks
just test-lib                     # Fast library tests
just example data.csv             # Run example analysis

Using pre-commit (Quality Gates)

pip install pre-commit            # Install pre-commit
pre-commit install                # Install hooks
pre-commit run --all-files        # Run all checks

Manual Commands

cargo build --release             # Build optimized
cargo test --lib                  # Fast library tests
cargo test --test integration_tests # Integration tests
cargo test --test v03_comprehensive # Comprehensive tests
cargo fmt --all                   # Format code
cargo clippy --all-targets --all-features -- -D warnings # Lint

Quality Assurance

The project uses automated quality checks:

  • Pre-commit hooks: Format, lint, test on every commit
  • Continuous Integration: 61/61 tests passing (100% success rate)
  • Code coverage: All major functions tested
  • Performance benchmarks: Verified 10x SIMD improvements

๐Ÿค Contributing

See CONTRIBUTING.md for development guidelines.

๐Ÿ“„ License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.