data-doctor-core 1.0.4

A powerful data validation and cleaning tool for JSON and CSV files
Documentation
# DataDoctor Core 🩺


[![Crates.io](https://img.shields.io/crates/v/data-doctor-core.svg)](https://crates.io/crates/data-doctor-core)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Documentation](https://docs.rs/data-doctor-core)](https://docs.rs/data-doctor-core)

**DataDoctor Core** is the intelligent engine behind the DataDoctor toolkit. It is a high-performance Rust library designed to valid, diagnose, and *automatically fix* common data quality issues in JSON and CSV datasets.

Unlike simple validators that just say "Error at line 5", DataDoctor Core attempts to **understand** the error and **repair** it using a combination of heuristic parsing, token stream analysis, and rule-based correction.

---

## 🧠 How It Works


DataDoctor Core operates using two main strategies:

### 1. The JSON Repair Engine

For JSON data, we don't just use a standard parser (which fails immediately on errors). Instead, we implement a custom, fault-tolerant token stream analyzer that can:
- **Lookahead/Lookbehind**: To detect trailing commas or missing commas.
- **Context Awareness**: To know if a quote is missing from a key or a value.
- **Structural Repair**: To balance unclosed braces `{}` and brackets `[]`.

### 2. The CSV Normalizer

For CSV files, the engine handles the complexities of delimiters and column alignment:
- **Delimiter Detection**: statistical analysis of the first few lines to guess if it's `,`, `;`, `\t`, or `|`.
- **Column Padding**: Auto-fills missing fields with empty values to preserve row structure.
- **Type Coercion**: Smartly converts "Yes"/"No" to `true`/`false`, validates emails, and normalizes headers.

---

## ✨ Features


- **Robust Validation**: Detailed error reporting with row/column locations and specific error codes.
- **Auto-Fixing**:
    - **JSON**: Trailing commas, missing quotes, single quotes -> double quotes, unclosed brackets.
    - **CSV**: Padding missing columns, trimming extra columns, boolean normalization, whitespace trimming.
- **Schema Validation**: Define optional schemas to enforce data types (Integer, Email, URL, etc.) and required fields.
- **Streaming Architecture**: Designed to handle large files efficiently using `Read` streams.

---

## 📦 Installation


Add this to your `Cargo.toml`:

```toml
[dependencies]
data-doctor-core = "1.0"
```

---

## 📖 Usage Guide


### 1. Basic JSON Validation & Fixing


Use `JsonValidator` to fix broken JSON strings.

```rust
use data_doctor_core::json::JsonValidator;
use data_doctor_core::ValidationOptions;

fn main() {
    // Broken JSON: Trailing comma, single quotes, unquoted key
    let bad_json = r#"{ name: 'John Doe', age: 30, }"#;

    let mut options = ValidationOptions::default();
    options.auto_fix = true; // Enable the repair engine

    let validator = JsonValidator::new();
    let (fixed_json, result) = validator.validate_and_fix(bad_json, &options);

    if result.success {
        println!("Fixed: {}", fixed_json);
        // Output: { "name": "John Doe", "age": 30 }
    }
}
```

### 2. Streaming CSV Validation


Validate large CSV files efficiently using `validate_csv_stream`.

```rust
use data_doctor_core::{validate_csv_stream, ValidationOptions};
use std::fs::File;
use std::io::BufReader;

fn main() -> std::io::Result<()> {
    let file = File::open("data.csv")?;
    let reader = BufReader::new(file);

    let options = ValidationOptions {
        csv_delimiter: b',',
        max_errors: 100, // Stop after 100 errors
        auto_fix: false, // Just validate, don't fix
        ..Default::default()
    };

    let result = validate_csv_stream(reader, &options);

    println!("processed {} records", result.stats.total_records);
    println!("found {} invalid records", result.stats.invalid_records);
    
    // Inspect specific issues
    for issue in result.issues {
        println!("[{}] Row {}: {}", issue.severity, issue.row.unwrap_or(0), issue.message);
    }

    Ok(())
}
```

### 3. Enforcing a Schema


You can define a `Schema` to ensure data meets specific requirements.

```rust
use data_doctor_core::schema::{Schema, FieldSchema, DataType, Constraint};
use data_doctor_core::ValidationOptions;

fn main() {
    let mut schema = Schema::new("user_profile");
    
    // Define fields
    schema.add_field(FieldSchema::new("email", DataType::Email)
        .add_constraint(Constraint::Required));
        
    schema.add_field(FieldSchema::new("age", DataType::Integer));

    let mut options = ValidationOptions::default();
    options.schema = Some(schema);

    // Now validate your data against this schema...
}
```

---

## ⚙️ Configuration (`ValidationOptions`)


| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `auto_fix` | `bool` | `false` | If `true`, the engine attempts to repair detected issues. |
| `max_errors` | `usize` | `0` (unlimited) | Stop processing after finding N errors (useful for large files). |
| `csv_delimiter` | `u8` | `b','` | The delimiter character for CSV parsing. |
| `schema` | `Option<Schema>` | `None` | Optional data schema for stricter validation. |

---

## 🤝 Contributing


Contributions are welcome! Please check out the [main repository](https://github.com/jeevanms003/data-doctor) for guidelines.

## 📄 License


MIT License.