Expand description
High-performance data profiling with ISO 8000/25012 quality metrics.
The dataprof crate is the user-facing facade for profiling CSV, JSON,
JSONL, Parquet, database, DataFrame, and Arrow sources. Implementation
details live in the workspace crates under crates/; this package keeps the
public API compact and oriented around Profiler.
§Quick Start
use dataprof::Profiler;
let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:?}", report.quality_score());Structs§
- Accuracy
Metrics - Accuracy metrics (ISO 25012).
- Async
Source Info - Metadata about an async data source for report construction and progress tracking.
- Async
Streaming Profiler - Async streaming profiler that accepts
AsyncDataSourceinstead of file paths. - Boolean
Stats - Statistics for boolean columns.
- Bytes
Source - An in-memory byte buffer that implements
AsyncDataSource. - Column
Profile - Profiling statistics for a single column.
- Column
Schema - A single column’s name and inferred data type.
- Completeness
Metrics - Completeness metrics (ISO 8000-8).
- Consistency
Metrics - Consistency metrics (ISO 8000-61).
- CsvDiagnostics
- CsvParser
Config - Configuration for CSV parsing and analysis.
- Database
Config - Database configuration for connection strings and settings
- Database
Credentials - Database credentials management with environment variable support
- Dataprof
Config - Main configuration structure for dataprof.
- Dataprof
Config Builder - Builder for constructing DataprofConfig with a fluent API.
- Date
Time Stats - Statistics for date/datetime columns.
- Execution
Metadata - Metadata about the profiling execution.
- Frequency
Item - A value and its frequency count within a column.
- Http
Parquet Reader - An asynchronous reader that fetches byte ranges from an HTTP server using HTTP Range requests. Designed specifically for remote Parquet parsing.
- Input
Validator - Enhanced input validation with helpful error messages and suggestions
- Json
Parser Config - Configuration for JSON/JSONL parsing and scanning.
- Metrics
Calculator - Engine for calculating comprehensive data quality metrics Supports ISO 8000/25012 configurable thresholds
- MySql
Connector - MySQL/MariaDB connector
- Numeric
Stats - Statistics for numeric (integer or float) columns.
- Parquet
Config - Configuration options for Parquet analysis.
- Parquet
Metadata - Metadata specific to Parquet files
- Pattern
- A detected value pattern within a column (e.g. email, phone, UUID).
- Postgres
Connector - PostgreSQL connector with connection pooling support
- Profile
Report - Complete profiling report for a data source.
- Profiler
- Unified profiler with builder pattern
- Profiler
Config - Plain-data configuration for a profiler
- Quality
Assessment - Wraps quality metrics with confidence information.
- Quality
Metrics - Comprehensive data quality metrics following industry standards.
- Quartiles
- Quartile statistics for numeric distributions.
- Reqwest
Source - An HTTP response body that implements
AsyncDataSource. - Retry
Config - Retry configuration for database operations
- RowCount
Estimate - Result of a quick row count operation.
- Sampling
Config - Configuration for database table sampling
- Schema
Result - Result of fast schema inference — column names paired with inferred data types.
- Semantic
Hints - User-supplied semantic hints that affect profiling and quality metrics.
- Sqlite
Connector - SQLite embedded database connector
- SslConfig
- SSL/TLS configuration for database connections
- Stop
Evaluator - Runtime evaluator that checks a
StopConditionagainst accumulated counters. - Text
Stats - Statistics for text/string columns.
- Timeliness
Metrics - Timeliness metrics (ISO 8000-8).
- Uniqueness
Metrics - Uniqueness metrics (ISO 8000-110).
- Validation
Error
Enums§
- Chunk
Size - Column
Stats - Type-specific statistics for a column, determined by the inferred data type.
- Count
Method - Method used to obtain the row count.
- Data
Frame Library - Source library for in-memory DataFrame profiling
- Data
Profiler Error - Enhanced error types with more descriptive messages for DataProfiler
- Data
Source - Source-agnostic data source metadata.
- Data
Type - Inferred column data type.
- DbSampling
Strategy - Available sampling strategies for large databases
- Engine
Type - Which engine to use for profiling
- File
Format - Supported file formats for data profiling
- Json
Format - JSON/JSONL format hint.
- Metric
Confidence - Confidence level for quality metrics.
- Metric
Pack - High-level categories of analysis that can be selectively enabled.
- Output
Format - Output format for serialized reports.
- Pattern
Category - Semantic category for a detected pattern.
- Progress
Event - Structured progress events emitted by profiling engines.
- Progress
Sink - How progress events are delivered to the consumer.
- Quality
Dimension - ISO 25012 quality dimensions that can be selectively requested.
- Query
Engine - Supported query engines for SQL-based profiling
- Sampling
Strategy - Stop
Condition - A composable condition that can trigger early termination of profiling.
- Truncation
Reason - Reason why profiling was truncated before exhausting the source.
Traits§
- Async
Data Source - A source of raw bytes that can be consumed asynchronously.
- Database
Connector - Trait that all database connectors must implement
Functions§
- analyze_
column - Analyze a column with full profiling (includes pattern detection and unique counts)
- analyze_
column_ fast - Analyze a column in fast mode (skips expensive operations)
- analyze_
csv_ file - Analyze a CSV file, returning a full
ProfileReport. - analyze_
csv_ from_ reader - Analyze CSV data from any
Readsource using streaming statistics. - analyze_
database - High-level function to analyze a database table or query.
- analyze_
json_ file - Analyze a JSON or JSONL file, returning a full
ProfileReport. - analyze_
json_ from_ reader - Analyze JSON/JSONL data from a buffered reader using streaming statistics.
- analyze_
parquet_ async_ http - Analyzes a remote Parquet file by fetching it via HTTP Range requests.
- analyze_
parquet_ with_ config - analyze_
parquet_ with_ quality - analyze_
parquet_ with_ quality_ dims - calculate_
datetime_ stats - calculate_
numeric_ stats - calculate_
text_ stats - create_
connector - Factory function to create appropriate database connector
- detect_
patterns - Detect common data patterns in a column.
- infer_
schema - Infer the schema (column names + data types) of a file.
- infer_
schema_ async - infer_
schema_ stream - Infer schema from any async byte stream.
- infer_
type - is_
parquet_ file - Check if a file is a valid Parquet file by examining its magic number.
- quick_
quality_ check - One-liner API for quick profiling with intelligent engine selection
- quick_
quality_ check_ source - One-liner API for quick profiling from a DataSource
- quick_
row_ count - Quick row count (exact or estimated) for a file.
- quick_
row_ count_ async - quick_
row_ count_ stream - Quick row count from any async byte stream.