Skip to main content

Crate dataprof

Crate dataprof 

Source
Expand description

High-performance data profiling with ISO 8000/25012 quality metrics.

The dataprof crate is the user-facing facade for profiling CSV, JSON, JSONL, Parquet, database, DataFrame, and Arrow sources. Implementation details live in the workspace crates under crates/; this package keeps the public API compact and oriented around Profiler.

§Quick Start

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:?}", report.quality_score());

Structs§

AccuracyMetrics
Accuracy metrics (ISO 25012).
AsyncSourceInfo
Metadata about an async data source for report construction and progress tracking.
AsyncStreamingProfiler
Async streaming profiler that accepts AsyncDataSource instead of file paths.
BooleanStats
Statistics for boolean columns.
BytesSource
An in-memory byte buffer that implements AsyncDataSource.
ColumnProfile
Profiling statistics for a single column.
ColumnSchema
A single column’s name and inferred data type.
CompletenessMetrics
Completeness metrics (ISO 8000-8).
ConsistencyMetrics
Consistency metrics (ISO 8000-61).
CsvDiagnostics
CsvParserConfig
Configuration for CSV parsing and analysis.
DatabaseConfig
Database configuration for connection strings and settings
DatabaseCredentials
Database credentials management with environment variable support
DataprofConfig
Main configuration structure for dataprof.
DataprofConfigBuilder
Builder for constructing DataprofConfig with a fluent API.
DateTimeStats
Statistics for date/datetime columns.
ExecutionMetadata
Metadata about the profiling execution.
FrequencyItem
A value and its frequency count within a column.
HttpParquetReader
An asynchronous reader that fetches byte ranges from an HTTP server using HTTP Range requests. Designed specifically for remote Parquet parsing.
InputValidator
Enhanced input validation with helpful error messages and suggestions
JsonParserConfig
Configuration for JSON/JSONL parsing and scanning.
MetricsCalculator
Engine for calculating comprehensive data quality metrics Supports ISO 8000/25012 configurable thresholds
MySqlConnector
MySQL/MariaDB connector
NumericStats
Statistics for numeric (integer or float) columns.
ParquetConfig
Configuration options for Parquet analysis.
ParquetMetadata
Metadata specific to Parquet files
Pattern
A detected value pattern within a column (e.g. email, phone, UUID).
PostgresConnector
PostgreSQL connector with connection pooling support
ProfileReport
Complete profiling report for a data source.
Profiler
Unified profiler with builder pattern
ProfilerConfig
Plain-data configuration for a profiler
QualityAssessment
Wraps quality metrics with confidence information.
QualityMetrics
Comprehensive data quality metrics following industry standards.
Quartiles
Quartile statistics for numeric distributions.
ReqwestSource
An HTTP response body that implements AsyncDataSource.
RetryConfig
Retry configuration for database operations
RowCountEstimate
Result of a quick row count operation.
SamplingConfig
Configuration for database table sampling
SchemaResult
Result of fast schema inference — column names paired with inferred data types.
SemanticHints
User-supplied semantic hints that affect profiling and quality metrics.
SqliteConnector
SQLite embedded database connector
SslConfig
SSL/TLS configuration for database connections
StopEvaluator
Runtime evaluator that checks a StopCondition against accumulated counters.
TextStats
Statistics for text/string columns.
TimelinessMetrics
Timeliness metrics (ISO 8000-8).
UniquenessMetrics
Uniqueness metrics (ISO 8000-110).
ValidationError

Enums§

ChunkSize
ColumnStats
Type-specific statistics for a column, determined by the inferred data type.
CountMethod
Method used to obtain the row count.
DataFrameLibrary
Source library for in-memory DataFrame profiling
DataProfilerError
Enhanced error types with more descriptive messages for DataProfiler
DataSource
Source-agnostic data source metadata.
DataType
Inferred column data type.
DbSamplingStrategy
Available sampling strategies for large databases
EngineType
Which engine to use for profiling
FileFormat
Supported file formats for data profiling
JsonFormat
JSON/JSONL format hint.
MetricConfidence
Confidence level for quality metrics.
MetricPack
High-level categories of analysis that can be selectively enabled.
OutputFormat
Output format for serialized reports.
PatternCategory
Semantic category for a detected pattern.
ProgressEvent
Structured progress events emitted by profiling engines.
ProgressSink
How progress events are delivered to the consumer.
QualityDimension
ISO 25012 quality dimensions that can be selectively requested.
QueryEngine
Supported query engines for SQL-based profiling
SamplingStrategy
StopCondition
A composable condition that can trigger early termination of profiling.
TruncationReason
Reason why profiling was truncated before exhausting the source.

Traits§

AsyncDataSource
A source of raw bytes that can be consumed asynchronously.
DatabaseConnector
Trait that all database connectors must implement

Functions§

analyze_column
Analyze a column with full profiling (includes pattern detection and unique counts)
analyze_column_fast
Analyze a column in fast mode (skips expensive operations)
analyze_csv_file
Analyze a CSV file, returning a full ProfileReport.
analyze_csv_from_reader
Analyze CSV data from any Read source using streaming statistics.
analyze_database
High-level function to analyze a database table or query.
analyze_json_file
Analyze a JSON or JSONL file, returning a full ProfileReport.
analyze_json_from_reader
Analyze JSON/JSONL data from a buffered reader using streaming statistics.
analyze_parquet_async_http
Analyzes a remote Parquet file by fetching it via HTTP Range requests.
analyze_parquet_with_config
analyze_parquet_with_quality
analyze_parquet_with_quality_dims
calculate_datetime_stats
calculate_numeric_stats
calculate_text_stats
create_connector
Factory function to create appropriate database connector
detect_patterns
Detect common data patterns in a column.
infer_schema
Infer the schema (column names + data types) of a file.
infer_schema_async
infer_schema_stream
Infer schema from any async byte stream.
infer_type
is_parquet_file
Check if a file is a valid Parquet file by examining its magic number.
quick_quality_check
One-liner API for quick profiling with intelligent engine selection
quick_quality_check_source
One-liner API for quick profiling from a DataSource
quick_row_count
Quick row count (exact or estimated) for a file.
quick_row_count_async
quick_row_count_stream
Quick row count from any async byte stream.