csv-nose 0.3.1 - Docs.rs

# Performance and Known Limitations

This document describes cases where csv-nose may not correctly detect CSV dialects, helping you understand when to use manual overrides.

## Accuracy Summary

Tested against standard CSV benchmark datasets:

| Dataset | Success Rate | Notes |
|---------|--------------|-------|
| POLLOCK | 95.95% | General CSV files |
| W3C-CSVW | 94.12% | W3C CSV on the Web test suite |
| CSV Wrangling | 91.06% | Real-world messy CSVs |
| CSV Wrangling CODEC | 90.85% | Filtered subset |
| CSV Wrangling MESSY | 89.68% | Non-normal structures |

## Known Limitations

### Uncommon Delimiters

csv-nose is biased toward common delimiters (`,`, `;`, `\t`) to improve accuracy on real-world data. Files using rare delimiters may be misdetected.

**Space-delimited files** (0.75 penalty):
- Spaces appear frequently in text content, making them difficult to distinguish as delimiters
- Examples: `diamonds.csv`, `dict.csv`, `methane_molecular_structure_xyz_20140911.csv`

**Hash-delimited files** (0.60 penalty):
- Hash (`#`) is commonly used as a comment marker
- Examples: `councils.csv`, `flat_file_database.csv`, `uniq_nl_data.csv`

**Other rare delimiters**:
- Ampersand (`&`): 0.60 penalty
- Forward slash (`/`): 0.65 penalty
- Section sign (`§`): 0.70 penalty
- Caret (`^`) and tilde (`~`): 0.80 penalty
- Colon (`:`): 0.90 penalty (often appears in timestamps)

**Workaround**: Use explicit delimiter hint:
```rust
use csv_nose::Sniffer;

let metadata = Sniffer::new()
    .delimiter(b' ')  // Force space delimiter
    .sniff_path("space-delimited.csv")?;
```

### Quote Character Detection

**Single-quote vs double-quote**:
- Single quotes require 2x the density threshold to be detected (to avoid false positives from apostrophes in text like "John's")
- When double quotes are present, single-quote dialects receive a 0.95 penalty
- Examples: `Auto_Tone_sub315_day1.csv`, `currencies.csv`, `isco.csv`

**Quote::None when quotes exist**:
- When double quotes have ≥0.5% density, `Quote::None` receives a 0.90 penalty
- This helps prefer quoted parsing when evidence exists

**Workaround**: Use explicit quote hint:
```rust
use csv_nose::{Sniffer, Quote};

let metadata = Sniffer::new()
    .quote(Quote::Some(b'\''))  // Force single quote
    .sniff_path("single-quoted.csv")?;
```

### Small Files

Files with few rows have less reliable detection:

| Rows | Penalty |
|------|---------|
| < 3 | 0.70 |
| 3-4 | 0.85 |
| ≥ 5 | None |

**Workaround**: Increase sample size or provide hints:
```rust
use csv_nose::{Sniffer, SampleSize};

let metadata = Sniffer::new()
    .sample_size(SampleSize::All)  // Read entire file
    .sniff_path("small.csv")?;
```

### Multi-table and Embedded Content

Files containing multiple tables or embedded non-CSV content may confuse detection:
- `file_multitable_less.csv`
- `file_multitable_more.csv`
- `file_multitable_same.csv`

These files have ambiguous structure where multiple dialects produce similar uniformity scores.

### Extreme Field Counts

**Single field** (0.50 penalty):
- A single field per row usually indicates the wrong delimiter was selected

**Very high field counts**:
- 50-100 fields: 0.80 penalty
- \>100 fields: 0.50 penalty
- May indicate splitting on a character that appears frequently in content

## Scoring Algorithm Reference

### Delimiter Penalties

| Delimiter | Penalty | Priority (tiebreaker) |
|-----------|---------|----------------------|
| `,` `;` `\t` | 1.00 | 10, 9, 8 |
| `\|` | 0.98 | 7 |
| `:` | 0.90 | 4 |
| `^` `~` | 0.80 | 3 |
| ` ` (space) | 0.75 | 2 |
| `§` `/` | 0.70, 0.65 | 2 |
| `#` `&` | 0.60 | 1 |

When scores are within 10%, delimiter priority is used as a tiebreaker.

### Quote Evidence Scoring

| Condition | Multiplier |
|-----------|------------|
| Double quotes with ≥0.5% density | 1.03 boost |
| Single quotes dominating (2x threshold), no double quotes | 1.05 boost |
| Single quote dialect when double quotes present | 0.95 penalty |
| Quote::None when double quotes have ≥0.5% density | 0.90 penalty |

## Workarounds Summary

```rust
use csv_nose::{Sniffer, Quote, SampleSize};

// Force specific delimiter
let metadata = Sniffer::new()
    .delimiter(b'#')
    .sniff_path("hash-delimited.csv")?;

// Force specific quote character
let metadata = Sniffer::new()
    .quote(Quote::Some(b'\''))
    .sniff_path("single-quoted.csv")?;

// Force no quoting
let metadata = Sniffer::new()
    .quote(Quote::None)
    .sniff_path("unquoted.csv")?;

// Read entire file instead of sampling
let metadata = Sniffer::new()
    .sample_size(SampleSize::All)
    .sniff_path("small.csv")?;

// Combine hints
let metadata = Sniffer::new()
    .delimiter(b' ')
    .quote(Quote::None)
    .sample_size(SampleSize::Records(1000))
    .sniff_path("tricky.csv")?;
```

## When to Use Alternative Approaches

Consider using explicit dialect specification (bypassing sniffing entirely) when:

1. **You know the dialect** - If your data source has a documented format
2. **Consistent pipeline** - Processing files from the same source repeatedly
3. **Rare delimiters** - Space, hash, or other uncommon separators
4. **Performance critical** - Sniffing adds overhead; known formats can skip detection

For these cases, use a CSV parser directly with explicit configuration rather than sniffing.