Skip to main content

Module quality

Module quality 

Source
Expand description

Data quality assessment for ML pipelines

Detects data quality issues including missing values, outliers, duplicates, and schema problems.

§100-Point Quality Scoring System (GH-6)

Based on the Toyota Way principles of Jidoka (built-in quality) and the Doctest Corpus QA Checklist for Publication.

§Severity Weights

  • Critical (2.0x): Blocks publication - data integrity failures
  • High (1.5x): Major issues requiring immediate attention
  • Medium (1.0x): Standard issues to address before publication
  • Low (0.5x): Minor issues, informational

§Letter Grades

  • A (95-100): Publish immediately
  • B (85-94): Publish with documented caveats
  • C (70-84): Remediation required before publication
  • D (50-69): Major rework needed
  • F (<50): Do not publish

§Example

use alimentar::quality::{QualityChecker, QualityScore};

let checker = QualityChecker::new()
    .max_null_ratio(0.1)
    .max_duplicate_ratio(0.05);

let report = checker.check(&dataset)?;
let score = QualityScore::from_report(&report);
println!("Grade: {} ({})", score.grade, score.score);

§References

  • [1] Batini & Scannapieco (2016). Data and Information Quality.
  • [6] Hynes et al. (2017). The Data Linter. NIPS Workshop on ML Systems.

Re-exports§

pub use decontaminate::check_contamination;
pub use decontaminate::ngram_overlap;
pub use decontaminate::ContaminationResult;
pub use decontaminate::DecontaminationReport;

Modules§

decontaminate
N-gram decontamination for benchmark safety.

Structs§

ChecklistItem
A scored quality check item from the 100-point checklist
ColumnQuality
Quality statistics for a single column
NumericStats
Basic statistics for numeric columns
QualityChecker
Data quality checker
QualityProfile
Quality profile for customizing scoring rules per data type.
QualityReport
Overall data quality report
QualityScore
Complete quality score with breakdown
QualityThresholds
Configuration thresholds for quality checking
SeverityStats
Statistics for a severity level
TextColumnStats
Statistics for a text (string) column — useful for ML classification audits.

Enums§

LetterGrade
Letter grades for dataset quality
QualityIssue
Types of data quality issues
Severity
Severity levels for quality issues per QA checklist