Expand description
Data quality variations for realistic synthetic data.
This module provides tools to introduce realistic data quality issues:
- Missing values (configurable by field)
- Format variations (dates, amounts, identifiers)
- Duplicates (exact and near-duplicates)
- Typos (substitution, transposition, insertion, deletion)
- Encoding issues (character corruption)
- Labels for ML training
These variations make synthetic data more realistic for testing data cleaning, ETL pipelines, and data quality tools.
Structs§
- Data
Quality Config - Configuration for the data quality injector.
- Data
Quality Config Builder - Builder for DataQualityConfig.
- Data
Quality Injector - Main data quality injector.
- Data
Quality Stats - Combined statistics for all data quality issues.
- Duplicate
Config - Configuration for duplicate generation.
- Duplicate
Detector - Detects potential duplicates in a dataset.
- Duplicate
Generator - Duplicate generator.
- Duplicate
Record - A duplicate record with metadata.
- Duplicate
Stats - Statistics for duplicate generation.
- Format
Variation Config - Configuration for format variations.
- Format
Variation Injector - Format variation injector.
- Format
Variation Stats - Statistics for format variations.
- Homophones
- Common homophones (words that sound alike).
- Keyboard
Layout - QWERTY keyboard layout for nearby key substitution.
- Missing
Condition - Condition for MAR missing values.
- Missing
Pattern - Pattern for MNAR missing values.
- Missing
Value Config - Configuration for missing values by field.
- Missing
Value Injector - Missing value injector.
- Missing
Value Stats - Statistics about missing values.
- OCRConfusions
- OCR-similar characters (often confused in OCR).
- Quality
Issue - A data quality issue record.
- Quality
Issue Label - A label describing a data quality issue for ML training.
- Quality
Label Summary - Summary statistics for quality labels.
- Quality
Labels - Collection of quality issue labels with aggregation methods.
- Typo
Config - Configuration for typo generation.
- Typo
Generator - Typo generator.
- Typo
Stats - Statistics for typo generation.
Enums§
- Amount
Format - Amount format variations.
- Condition
Type - Type of condition for missing values.
- Date
Format - Date format variations.
- Duplicate
Type - Type of duplicate.
- Encoding
Issue - Encoding issue types.
- Identifier
Format - Identifier format variations.
- Labeled
Issue Type - Type of data quality issue.
- Missing
Value - Represents a missing value placeholder.
- Missing
Value Strategy - Strategy for missing value injection.
- Pattern
Type - Type of pattern for MNAR.
- Quality
Issue Subtype - Subtype providing more detail about the issue.
- Quality
Issue Type - Type of quality issue.
- Text
Format - Text format variations.
- Typo
Type - Type of typo/error.
Traits§
- Duplicatable
- Trait for records that can be duplicated.
Functions§
- introduce_
encoding_ issue - Introduces encoding issues.
- random_
missing_ representation - Selects a random missing value representation.