DataDoctor Core 🩺
DataDoctor Core is the intelligent engine behind the DataDoctor toolkit. It is a high-performance Rust library designed to valid, diagnose, and automatically fix common data quality issues in JSON and CSV datasets.
Unlike simple validators that just say "Error at line 5", DataDoctor Core attempts to understand the error and repair it using a combination of heuristic parsing, token stream analysis, and rule-based correction.
🧠 How It Works
DataDoctor Core operates using two main strategies:
1. The JSON Repair Engine
For JSON data, we don't just use a standard parser (which fails immediately on errors). Instead, we implement a custom, fault-tolerant token stream analyzer that can:
- Lookahead/Lookbehind: To detect trailing commas or missing commas.
- Context Awareness: To know if a quote is missing from a key or a value.
- Structural Repair: To balance unclosed braces
{}and brackets[].
2. The CSV Normalizer
For CSV files, the engine handles the complexities of delimiters and column alignment:
- Delimiter Detection: statistical analysis of the first few lines to guess if it's
,,;,\t, or|. - Column Padding: Auto-fills missing fields with empty values to preserve row structure.
- Type Coercion: Smartly converts "Yes"/"No" to
true/false, validates emails, and normalizes headers.
✨ Features
- Robust Validation: Detailed error reporting with row/column locations and specific error codes.
- Auto-Fixing:
- JSON: Trailing commas, missing quotes, single quotes -> double quotes, unclosed brackets.
- CSV: Padding missing columns, trimming extra columns, boolean normalization, whitespace trimming.
- Schema Validation: Define optional schemas to enforce data types (Integer, Email, URL, etc.) and required fields.
- Streaming Architecture: Designed to handle large files efficiently using
Readstreams.
📦 Installation
Add this to your Cargo.toml:
[]
= "1.0"
📖 Usage Guide
1. Basic JSON Validation & Fixing
Use JsonValidator to fix broken JSON strings.
use JsonValidator;
use ValidationOptions;
2. Streaming CSV Validation
Validate large CSV files efficiently using validate_csv_stream.
use ;
use File;
use BufReader;
3. Enforcing a Schema
You can define a Schema to ensure data meets specific requirements.
use ;
use ValidationOptions;
⚙️ Configuration (ValidationOptions)
| Option | Type | Default | Description |
|---|---|---|---|
auto_fix |
bool |
false |
If true, the engine attempts to repair detected issues. |
max_errors |
usize |
0 (unlimited) |
Stop processing after finding N errors (useful for large files). |
csv_delimiter |
u8 |
b',' |
The delimiter character for CSV parsing. |
schema |
Option<Schema> |
None |
Optional data schema for stricter validation. |
🤝 Contributing
Contributions are welcome! Please check out the main repository for guidelines.
📄 License
MIT License.