Expand description
High-performance data profiling with ISO 8000/25012 quality metrics.
dataprof analyzes tabular data (CSV, JSON, JSONL, Parquet, databases, DataFrames, Arrow) and produces column-level statistics, pattern detection, and a quality assessment scored against the ISO 8000/25012 standard.
§Quick Start
use dataprof::Profiler;
let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:?}", report.quality_score());§Engines
EngineType::Auto— intelligent selection based on file size and format (default)EngineType::Incremental— true streaming with bounded memoryEngineType::Columnar— Arrow-based batch processing
§Feature Flags
| Feature | Description |
|---|---|
cli (default) | CLI binary |
async-streaming | Async profiling engine |
database | Database connectivity |
postgres, mysql, sqlite | Database connectors |
parquet-async | Profile Parquet over HTTP |
python | Python bindings via PyO3 |
Re-exports§
pub use api::EngineType;pub use api::Profiler;pub use api::ProfilerConfig;pub use api::quick_quality_check;pub use api::quick_quality_check_source;pub use api::partial::infer_schema;pub use api::partial::quick_row_count;pub use types::ColumnSchema;pub use types::CountMethod;pub use types::RowCountEstimate;pub use types::SchemaResult;pub use api::partial::infer_schema_async;pub use api::partial::infer_schema_stream;pub use api::partial::quick_row_count_async;pub use api::partial::quick_row_count_stream;pub use core::errors::DataProfilerError;pub use core::sampling::ChunkSize;pub use core::sampling::SamplingStrategy;pub use parsers::CsvDiagnostics;pub use core::config::DataprofConfig;pub use core::config::DataprofConfigBuilder;pub use core::stop_condition::StopCondition;pub use core::stop_condition::StopEvaluator;pub use core::validation::InputValidator;pub use core::validation::ValidationError;pub use core::progress::ProgressEvent;pub use core::progress::ProgressSink;pub use engines::streaming::AsyncDataSource;pub use engines::streaming::AsyncSourceInfo;pub use engines::streaming::AsyncStreamingProfiler;pub use engines::streaming::BytesSource;pub use engines::streaming::ReqwestSource;pub use engines::DataFusionLoader;pub use types::AccuracyMetrics;pub use types::ColumnProfile;pub use types::ColumnStats;pub use types::CompletenessMetrics;pub use types::ConsistencyMetrics;pub use types::DataFrameLibrary;pub use types::DataSource;pub use types::DataType;pub use types::ExecutionMetadata;pub use types::FileFormat;pub use types::MetricConfidence;pub use types::OutputFormat;pub use types::Pattern;pub use types::ProfileReport;pub use types::QualityAssessment;pub use types::QualityDimension;pub use types::QualityMetrics;pub use types::QueryEngine;pub use types::TimelinessMetrics;pub use types::TruncationReason;pub use types::UniquenessMetrics;pub use parsers::csv::CsvParserConfig;pub use parsers::csv::analyze_csv_file;pub use parsers::csv::analyze_csv_from_reader;pub use parsers::json::JsonFormat;pub use parsers::json::JsonParserConfig;pub use parsers::json::analyze_json_file;pub use parsers::json::analyze_json_from_reader;pub use parsers::parquet::ParquetConfig;pub use parsers::parquet::analyze_parquet_with_config;pub use parsers::parquet::analyze_parquet_with_quality;pub use parsers::parquet::is_parquet_file;pub use analysis::MetricsCalculator;pub use analysis::analyze_column_fast;pub use analysis::detect_patterns;pub use analysis::infer_type;pub use stats::calculate_numeric_stats;pub use stats::calculate_text_stats;pub use database::DatabaseConfig;pub use database::DatabaseConnector;pub use database::DatabaseCredentials;pub use database::MySqlConnector;pub use database::PostgresConnector;pub use database::RetryConfig;pub use database::SamplingConfig;pub use database::SamplingStrategy as DbSamplingStrategy;pub use database::SqliteConnector;pub use database::SslConfig;pub use database::analyze_database;pub use database::create_connector;
Modules§
- acceleration
- analysis
- api
- core
- database
- Database connectivity module for DataProfiler
- engines
- output
- parsers
- python
- serde_
helpers - Custom serde serialization helpers for formatting numeric values with appropriate precision
- stats
- types
Macros§
- process_
rows_ to_ columns - Macro to process rows into column-oriented HashMap. Used for single-query (non-streaming) profiling.
- streaming_
profile_ loop - Macro to generate the streaming batch loop for profiling queries. Handles the common pattern while allowing database-specific pool types. Includes inline row processing to avoid complex generic trait bounds.
Functions§
- check_
memory_ leaks - Global memory leak detection utility
- get_
memory_ usage_ stats - Get global memory usage statistics