# DataDoctor Core 🩺
[](https://crates.io/crates/data-doctor-core)
[](https://opensource.org/licenses/MIT)
[](https://docs.rs/data-doctor-core)
**DataDoctor Core** is the intelligent engine behind the DataDoctor toolkit. It is a high-performance Rust library designed to valid, diagnose, and *automatically fix* common data quality issues in JSON and CSV datasets.
Unlike simple validators that just say "Error at line 5", DataDoctor Core attempts to **understand** the error and **repair** it using a combination of heuristic parsing, token stream analysis, and rule-based correction.
---
## 🧠 How It Works
DataDoctor Core operates using two main strategies:
### 1. The JSON Repair Engine
For JSON data, we don't just use a standard parser (which fails immediately on errors). Instead, we implement a custom, fault-tolerant token stream analyzer that can:
- **Lookahead/Lookbehind**: To detect trailing commas or missing commas.
- **Context Awareness**: To know if a quote is missing from a key or a value.
- **Structural Repair**: To balance unclosed braces `{}` and brackets `[]`.
### 2. The CSV Normalizer
For CSV files, the engine handles the complexities of delimiters and column alignment:
- **Delimiter Detection**: statistical analysis of the first few lines to guess if it's `,`, `;`, `\t`, or `|`.
- **Column Padding**: Auto-fills missing fields with empty values to preserve row structure.
- **Type Coercion**: Smartly converts "Yes"/"No" to `true`/`false`, validates emails, and normalizes headers.
---
## ✨ Features
- **Robust Validation**: Detailed error reporting with row/column locations and specific error codes.
- **Auto-Fixing**:
- **JSON**: Trailing commas, missing quotes, single quotes -> double quotes, unclosed brackets.
- **CSV**: Padding missing columns, trimming extra columns, boolean normalization, whitespace trimming.
- **Schema Validation**: Define optional schemas to enforce data types (Integer, Email, URL, etc.) and required fields.
- **Streaming Architecture**: Designed to handle large files efficiently using `Read` streams.
---
## 📦 Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
data-doctor-core = "1.0"
```
---
## 📖 Usage Guide
### 1. Basic JSON Validation & Fixing
Use `JsonValidator` to fix broken JSON strings.
```rust
use data_doctor_core::json::JsonValidator;
use data_doctor_core::ValidationOptions;
fn main() {
// Broken JSON: Trailing comma, single quotes, unquoted key
let bad_json = r#"{ name: 'John Doe', age: 30, }"#;
let mut options = ValidationOptions::default();
options.auto_fix = true; // Enable the repair engine
let validator = JsonValidator::new();
let (fixed_json, result) = validator.validate_and_fix(bad_json, &options);
if result.success {
println!("Fixed: {}", fixed_json);
// Output: { "name": "John Doe", "age": 30 }
}
}
```
### 2. Streaming CSV Validation
Validate large CSV files efficiently using `validate_csv_stream`.
```rust
use data_doctor_core::{validate_csv_stream, ValidationOptions};
use std::fs::File;
use std::io::BufReader;
fn main() -> std::io::Result<()> {
let file = File::open("data.csv")?;
let reader = BufReader::new(file);
let options = ValidationOptions {
csv_delimiter: b',',
max_errors: 100, // Stop after 100 errors
auto_fix: false, // Just validate, don't fix
..Default::default()
};
let result = validate_csv_stream(reader, &options);
println!("processed {} records", result.stats.total_records);
println!("found {} invalid records", result.stats.invalid_records);
// Inspect specific issues
for issue in result.issues {
println!("[{}] Row {}: {}", issue.severity, issue.row.unwrap_or(0), issue.message);
}
Ok(())
}
```
### 3. Enforcing a Schema
You can define a `Schema` to ensure data meets specific requirements.
```rust
use data_doctor_core::schema::{Schema, FieldSchema, DataType, Constraint};
use data_doctor_core::ValidationOptions;
fn main() {
let mut schema = Schema::new("user_profile");
// Define fields
schema.add_field(FieldSchema::new("email", DataType::Email)
.add_constraint(Constraint::Required));
schema.add_field(FieldSchema::new("age", DataType::Integer));
let mut options = ValidationOptions::default();
options.schema = Some(schema);
// Now validate your data against this schema...
}
```
---
## ⚙️ Configuration (`ValidationOptions`)
| `auto_fix` | `bool` | `false` | If `true`, the engine attempts to repair detected issues. |
| `max_errors` | `usize` | `0` (unlimited) | Stop processing after finding N errors (useful for large files). |
| `csv_delimiter` | `u8` | `b','` | The delimiter character for CSV parsing. |
| `schema` | `Option<Schema>` | `None` | Optional data schema for stricter validation. |
---
## 🤝 Contributing
Contributions are welcome! Please check out the [main repository](https://github.com/jeevanms003/data-doctor) for guidelines.
## 📄 License
MIT License.