Skip to main content

Crate dataprof

Crate dataprof 

Source
Expand description

High-performance data profiling with ISO 8000/25012 quality metrics.

dataprof analyzes tabular data (CSV, JSON, JSONL, Parquet, databases, DataFrames, Arrow) and produces column-level statistics, pattern detection, and a quality assessment scored against the ISO 8000/25012 standard.

§Quick Start

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:?}", report.quality_score());

§Engines

§Feature Flags

FeatureDescription
cli (default)CLI binary
async-streamingAsync profiling engine
databaseDatabase connectivity
postgres, mysql, sqliteDatabase connectors
parquet-asyncProfile Parquet over HTTP
pythonPython bindings via PyO3

Re-exports§

pub use api::EngineType;
pub use api::Profiler;
pub use api::ProfilerConfig;
pub use api::quick_quality_check;
pub use api::quick_quality_check_source;
pub use api::partial::infer_schema;
pub use api::partial::quick_row_count;
pub use types::ColumnSchema;
pub use types::CountMethod;
pub use types::RowCountEstimate;
pub use types::SchemaResult;
pub use api::partial::infer_schema_async;
pub use api::partial::infer_schema_stream;
pub use api::partial::quick_row_count_async;
pub use api::partial::quick_row_count_stream;
pub use core::errors::DataProfilerError;
pub use core::sampling::ChunkSize;
pub use core::sampling::SamplingStrategy;
pub use parsers::CsvDiagnostics;
pub use core::config::DataprofConfig;
pub use core::config::DataprofConfigBuilder;
pub use core::stop_condition::StopCondition;
pub use core::stop_condition::StopEvaluator;
pub use core::validation::InputValidator;
pub use core::validation::ValidationError;
pub use core::progress::ProgressEvent;
pub use core::progress::ProgressSink;
pub use engines::streaming::AsyncDataSource;
pub use engines::streaming::AsyncSourceInfo;
pub use engines::streaming::AsyncStreamingProfiler;
pub use engines::streaming::BytesSource;
pub use engines::streaming::ReqwestSource;
pub use engines::DataFusionLoader;
pub use types::AccuracyMetrics;
pub use types::ColumnProfile;
pub use types::ColumnStats;
pub use types::CompletenessMetrics;
pub use types::ConsistencyMetrics;
pub use types::DataFrameLibrary;
pub use types::DataSource;
pub use types::DataType;
pub use types::ExecutionMetadata;
pub use types::FileFormat;
pub use types::MetricConfidence;
pub use types::OutputFormat;
pub use types::Pattern;
pub use types::ProfileReport;
pub use types::QualityAssessment;
pub use types::QualityDimension;
pub use types::QualityMetrics;
pub use types::QueryEngine;
pub use types::TimelinessMetrics;
pub use types::TruncationReason;
pub use types::UniquenessMetrics;
pub use parsers::csv::CsvParserConfig;
pub use parsers::csv::analyze_csv_file;
pub use parsers::csv::analyze_csv_from_reader;
pub use parsers::json::JsonFormat;
pub use parsers::json::JsonParserConfig;
pub use parsers::json::analyze_json_file;
pub use parsers::json::analyze_json_from_reader;
pub use parsers::parquet::ParquetConfig;
pub use parsers::parquet::analyze_parquet_with_config;
pub use parsers::parquet::analyze_parquet_with_quality;
pub use parsers::parquet::is_parquet_file;
pub use analysis::MetricsCalculator;
pub use analysis::analyze_column_fast;
pub use analysis::detect_patterns;
pub use analysis::infer_type;
pub use stats::calculate_numeric_stats;
pub use stats::calculate_text_stats;
pub use database::DatabaseConfig;
pub use database::DatabaseConnector;
pub use database::DatabaseCredentials;
pub use database::MySqlConnector;
pub use database::PostgresConnector;
pub use database::RetryConfig;
pub use database::SamplingConfig;
pub use database::SamplingStrategy as DbSamplingStrategy;
pub use database::SqliteConnector;
pub use database::SslConfig;
pub use database::analyze_database;
pub use database::create_connector;

Modules§

acceleration
analysis
api
core
database
Database connectivity module for DataProfiler
engines
output
parsers
python
serde_helpers
Custom serde serialization helpers for formatting numeric values with appropriate precision
stats
types

Macros§

process_rows_to_columns
Macro to process rows into column-oriented HashMap. Used for single-query (non-streaming) profiling.
streaming_profile_loop
Macro to generate the streaming batch loop for profiling queries. Handles the common pattern while allowing database-specific pool types. Includes inline row processing to avoid complex generic trait bounds.

Functions§

check_memory_leaks
Global memory leak detection utility
get_memory_usage_stats
Get global memory usage statistics