Expand description
§Term - Data Validation for Rust
Term is a powerful data validation library inspired by AWS Deequ, providing comprehensive data quality checks without requiring Apache Spark. It leverages DataFusion for efficient query execution and includes built-in observability through OpenTelemetry.
§Overview
Term enables you to define and run data quality validations on your datasets, helping you ensure data correctness, completeness, and consistency. Whether you’re validating data in ETL pipelines, ensuring data quality in analytics workflows, or monitoring data drift in production, Term provides the tools you need.
§Quick Start
use term_guard::prelude::*;
use term_guard::core::{ValidationSuite, Check, Level, ConstraintStatus, builder_extensions::CompletenessOptions};
use term_guard::constraints::Assertion;
use datafusion::prelude::*;
// Create a validation suite
let suite = ValidationSuite::builder("user_data_validation")
.check(
Check::builder("critical_checks")
.level(Level::Error)
.completeness("user_id", CompletenessOptions::full().into_constraint_options()) // No nulls allowed
.validates_uniqueness(vec!["user_id"], 1.0) // Must be unique
.completeness("email", CompletenessOptions::threshold(0.95).into_constraint_options()) // 95% non-null
.build()
)
.check(
Check::builder("data_quality")
.level(Level::Warning)
.validates_regex("email", r"^[^@]+@[^@]+$", 0.98)
.statistic("age", term_guard::constraints::StatisticType::Min, Assertion::GreaterThanOrEqual(0.0))
.statistic("age", term_guard::constraints::StatisticType::Max, Assertion::LessThanOrEqual(120.0))
.build()
)
.build();
// Create a DataFusion context with your data
let ctx = SessionContext::new();
// ... register your data tables ...
// Run validation
let results = suite.run(&ctx).await?;
// Check results
match &results {
term_guard::core::ValidationResult::Success { report, .. } => {
println!("Validation succeeded!");
println!("Total checks: {}", report.metrics.total_checks);
}
term_guard::core::ValidationResult::Failure { report } => {
println!("Validation failed!");
for issue in &report.issues {
println!("{}: {}", issue.check_name, issue.message);
}
}
}§Key Features
§Comprehensive Validation Constraints
- Completeness: Check for null values and missing data
- Uniqueness: Ensure values are unique (single or multi-column)
- Patterns: Validate data against regex patterns
- Statistics: Min, max, mean, sum, standard deviation checks
- Data Types: Ensure consistent data types
- Custom SQL: Define complex validation logic with SQL expressions
§Performance Optimization
Term includes a query optimizer that dramatically improves performance:
use term_guard::core::ValidationSuite;
let suite = ValidationSuite::builder("optimized_validation")
.with_optimizer(true) // Enable query optimization
// .check(/* your checks */)
.build();The optimizer combines multiple constraints into single queries when possible, reducing table scans and improving performance by up to 15x for suites with many constraints.
§Multiple Data Sources
Term supports various data sources through the sources module:
- CSV files
- Parquet files
- JSON files
- PostgreSQL databases
- Cloud storage (S3, Azure Blob, Google Cloud Storage)
§Observability
Built-in OpenTelemetry integration provides:
- Distributed tracing for validation runs
- Metrics for constraint evaluation performance
- Structured logging with the
tracingcrate
use term_guard::telemetry::TermTelemetry;
use opentelemetry::trace::Tracer;
// User configures their own tracer
let tracer = opentelemetry_jaeger::new_agent_pipeline()
.with_service_name("data-validation")
.install_simple()?;
let telemetry = TermTelemetry::new(tracer);§Architecture
Term is built on a modular architecture:
analyzers: Advanced data analysis framework including:- Type Inference Engine: Automatic data type detection with confidence scores
- Column Profiler: Three-pass algorithm for comprehensive column analysis
- Basic & Advanced Analyzers: Metrics computation (mean, entropy, correlation, etc.)
core: Core types likeCheck,ValidationSuite, andConstraintResultconstraints: All validation constraint implementationssources: Data source connectors and loadersoptimizer: Query optimization enginetelemetry: OpenTelemetry integrationformatters: Result formatting utilities
§Examples
See the examples directory for complete examples:
basic_validation.rs: Simple validation exampletpc_h_validation.rs: TPC-H benchmark data validationcloud_storage_example.rs: Validating data in cloud storagedeequ_migration.rs: Migrating from Deequ to Term
§Migration from Deequ
Term provides similar APIs to Deequ, making migration straightforward:
use term_guard::core::{Check, builder_extensions::CompletenessOptions};
use term_guard::constraints::Assertion;
// Deequ-style checks in Term
let check = Check::builder("data_quality")
.has_size(Assertion::GreaterThan(1000.0))
.completeness("id", CompletenessOptions::full().into_constraint_options())
.completeness("name", CompletenessOptions::threshold(0.98).into_constraint_options())
.validates_uniqueness(vec!["id"], 1.0)
.build();Modules§
- analyzers
- Core analyzer framework for computing metrics from data.
- constraints
- Built-in constraint implementations for data validation.
- core
- Core validation types for the Term data quality library.
- error
- Error types for the Term data validation library.
- formatters
- Result formatting and reporting for Term validation results.
- logging
- Logging utilities and configuration for Term.
- optimizer
- Query optimization for validation constraints.
- prelude
- Prelude for commonly used types and traits in term-guard.
- repository
- Metrics repository framework for persisting and querying analyzer results.
- security
- Security utilities for Term data validation library.
- sources
- Data source connectors for Term validation library.
- telemetry
- OpenTelemetry integration for Term validation library.
Macros§
- impl_
unified_ constraint - Helper macro for implementing common constraint patterns.
- log_
constraint - Macro for conditional constraint logging.
- log_
data_ op - Macro for conditional data operation logging.
- perf_
debug - Macro for performance-sensitive debug logging.