Expand description
Data quality assessment for ML pipelines
Detects data quality issues including missing values, outliers, duplicates, and schema problems.
§100-Point Quality Scoring System (GH-6)
Based on the Toyota Way principles of Jidoka (built-in quality) and the Doctest Corpus QA Checklist for Publication.
§Severity Weights
- Critical (2.0x): Blocks publication - data integrity failures
- High (1.5x): Major issues requiring immediate attention
- Medium (1.0x): Standard issues to address before publication
- Low (0.5x): Minor issues, informational
§Letter Grades
- A (95-100): Publish immediately
- B (85-94): Publish with documented caveats
- C (70-84): Remediation required before publication
- D (50-69): Major rework needed
- F (<50): Do not publish
§Example
ⓘ
use alimentar::quality::{QualityChecker, QualityScore};
let checker = QualityChecker::new()
.max_null_ratio(0.1)
.max_duplicate_ratio(0.05);
let report = checker.check(&dataset)?;
let score = QualityScore::from_report(&report);
println!("Grade: {} ({})", score.grade, score.score);§References
- [1] Batini & Scannapieco (2016). Data and Information Quality.
- [6] Hynes et al. (2017). The Data Linter. NIPS Workshop on ML Systems.
Re-exports§
pub use decontaminate::check_contamination;pub use decontaminate::ngram_overlap;pub use decontaminate::ContaminationResult;pub use decontaminate::DecontaminationReport;
Modules§
- decontaminate
- N-gram decontamination for benchmark safety.
Structs§
- Checklist
Item - A scored quality check item from the 100-point checklist
- Column
Quality - Quality statistics for a single column
- Numeric
Stats - Basic statistics for numeric columns
- Quality
Checker - Data quality checker
- Quality
Profile - Quality profile for customizing scoring rules per data type.
- Quality
Report - Overall data quality report
- Quality
Score - Complete quality score with breakdown
- Quality
Thresholds - Configuration thresholds for quality checking
- Severity
Stats - Statistics for a severity level
- Text
Column Stats - Statistics for a text (string) column — useful for ML classification audits.
Enums§
- Letter
Grade - Letter grades for dataset quality
- Quality
Issue - Types of data quality issues
- Severity
- Severity levels for quality issues per QA checklist