Skip to main content

Module data_quality

Module data_quality 

Source
Expand description

Data quality variations for realistic synthetic data.

This module provides tools to introduce realistic data quality issues:

  • Missing values (configurable by field)
  • Format variations (dates, amounts, identifiers)
  • Duplicates (exact and near-duplicates)
  • Typos (substitution, transposition, insertion, deletion)
  • Encoding issues (character corruption)
  • Labels for ML training

These variations make synthetic data more realistic for testing data cleaning, ETL pipelines, and data quality tools.

Structs§

DataQualityConfig
Configuration for the data quality injector.
DataQualityConfigBuilder
Builder for DataQualityConfig.
DataQualityInjector
Main data quality injector.
DataQualityStats
Combined statistics for all data quality issues.
DuplicateConfig
Configuration for duplicate generation.
DuplicateDetector
Detects potential duplicates in a dataset.
DuplicateGenerator
Duplicate generator.
DuplicateRecord
A duplicate record with metadata.
DuplicateStats
Statistics for duplicate generation.
FormatVariationConfig
Configuration for format variations.
FormatVariationInjector
Format variation injector.
FormatVariationStats
Statistics for format variations.
Homophones
Common homophones (words that sound alike).
KeyboardLayout
QWERTY keyboard layout for nearby key substitution.
MissingCondition
Condition for MAR missing values.
MissingPattern
Pattern for MNAR missing values.
MissingValueConfig
Configuration for missing values by field.
MissingValueInjector
Missing value injector.
MissingValueStats
Statistics about missing values.
OCRConfusions
OCR-similar characters (often confused in OCR).
QualityIssue
A data quality issue record.
QualityIssueLabel
A label describing a data quality issue for ML training.
QualityLabelSummary
Summary statistics for quality labels.
QualityLabels
Collection of quality issue labels with aggregation methods.
TypoConfig
Configuration for typo generation.
TypoGenerator
Typo generator.
TypoStats
Statistics for typo generation.

Enums§

AmountFormat
Amount format variations.
ConditionType
Type of condition for missing values.
DateFormat
Date format variations.
DuplicateType
Type of duplicate.
EncodingIssue
Encoding issue types.
IdentifierFormat
Identifier format variations.
LabeledIssueType
Type of data quality issue.
MissingValue
Represents a missing value placeholder.
MissingValueStrategy
Strategy for missing value injection.
PatternType
Type of pattern for MNAR.
QualityIssueSubtype
Subtype providing more detail about the issue.
QualityIssueType
Type of quality issue.
TextFormat
Text format variations.
TypoType
Type of typo/error.

Traits§

Duplicatable
Trait for records that can be duplicated.

Functions§

introduce_encoding_issue
Introduces encoding issues.
random_missing_representation
Selects a random missing value representation.