csv-nose
A Rust port of the Table Uniformity Method for CSV dialect detection.
Background
This crate implements the algorithm from "Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference"[^1] by W. García. This implementation of the Table Uniformity Method achieves 99.55%[^2] accuracy on the W3C-CSVW test suite by:
- Testing multiple potential dialects (delimiter × quote × line terminator combinations)
- Scoring each dialect based on table uniformity (consistent field counts)
- Scoring based on type detection (consistent data types within columns)
- Selecting the dialect with the highest combined gamma score
[^1]: García W. Detecting CSV file dialects by table uniformity measurement and data type inference. Data Science. 2024;7(2):55-72. doi:10.3233/DS-240062
Installation
As a library
[]
= "0.6"
As a CLI tool
With HTTP support (for remote URLs)
Library Usage
use ;
let mut sniffer = new;
sniffer.sample_size;
let metadata = sniffer.sniff_path.unwrap;
println!;
println!;
println!;
println!;
CLI Usage
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
) ()
) ()
)
)
)
)
)
)
)
)
)
)
)
)
)
Remote URL Support
When built with the http feature, csv-nose can sniff remote CSV files directly from URLs:
# Build with HTTP support
# Sniff remote CSV
# Limit bytes fetched (useful for large remote files)
The HTTP feature uses Range requests when supported by the server to minimize data transfer. If the server doesn't support Range requests, it falls back to downloading and truncating at the sample size limit.
API Compatibility
This library is designed as a drop-in replacement for qsv-sniffer used by qsv. The public API mirrors qsv-sniffer for easy migration:
use ;
let mut sniffer = new;
sniffer
.sample_size
.date_preference
.delimiter
.quote;
Benchmarks
csv-nose is benchmarked against the same test datasets used by CSVsniffer, enabling direct accuracy comparison with other CSV dialect detection tools.
Success Ratio
The table below shows the dialect detection success ratio. Accuracy is measured using only files that do not produce errors during dialect inference.
| Data set | csv-nose |
CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|---|
| POLLOCK | 96.62% | 95.27% | 96.55% | 95.17% | 96.35% | 84.14% |
| W3C-CSVW[^2] | 99.55% | 94.52% | 95.39% | 61.11% | 97.69% | 99.08% |
| CSV Wrangling | 87.15% | 90.50% | 89.94% | 87.99% | 84.26% | 91.62% |
| CSV Wrangling CODEC | 86.62% | 90.14% | 90.14% | 89.44% | 84.18% | 92.25% |
| CSV Wrangling MESSY | 84.92% | 89.60% | 89.60% | 89.60% | 83.06% | 91.94% |
[^2]: csv-nose is optimized for the W3C CSV on the Web Test Suite - reaching 99.55% accuracy.
Failure Ratio
The table below shows the failure ratio (errors during dialect detection) for each tool.
Note: "Errors" are files that caused crashes or exceptions during processing (e.g., encoding issues, malformed data). This is distinct from "failures" where a file was successfully processed but the wrong dialect was detected. A 0% error rate means all files were processed without crashes, even if some detections were incorrect.
| Data set | csv-nose |
CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|---|
| POLLOCK [148 files] | 0.00% | 0.00% | 2.03% | 2.03% | 7.43% | 2.03% |
| W3C-CSVW [221 files] | 0.00% | 0.91% | 1.81% | 2.26% | 41.18% | 1.81% |
| CSV Wrangling [179 files] | 0.00% | 0.00% | 0.56% | 0.56% | 39.66% | 0.00% |
| CSV Wrangling CODEC [142 files] | 0.00% | 0.00% | 0.00% | 0.00% | 38.03% | 0.00% |
| CSV Wrangling MESSY [126 files] | 0.00% | 0.79% | 0.79% | 0.79% | 42.06% | 0.79% |
F1 Score
The F1 score is the harmonic mean of precision and recall, providing a balanced measure of dialect detection accuracy.
| Data set | csv-nose |
CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|---|
| POLLOCK | 0.966 | 0.976 | 0.972 | 0.965 | 0.943 | 0.904 |
| W3C-CSVW | 0.995 | 0.967 | 0.967 | 0.748 | 0.730 | 0.986 |
| CSV Wrangling | 0.872 | 0.950 | 0.945 | 0.935 | 0.724 | 0.956 |
| CSV Wrangling CODEC | 0.866 | 0.948 | 0.948 | 0.944 | 0.728 | 0.959 |
| CSV Wrangling MESSY | 0.849 | 0.943 | 0.943 | 0.943 | 0.705 | 0.956 |
Component Accuracy
csv-nose's delimiter and quote detection accuracy on each dataset:
| Data set | Delimiter Accuracy | Quote Accuracy |
|---|---|---|
| POLLOCK | 96.62% | 100.00% |
| W3C-CSVW | 99.55% | 100.00% |
| CSV Wrangling | 89.94% | 96.65% |
| CSV Wrangling CODEC | 89.44% | 96.48% |
| CSV Wrangling MESSY | 88.10% | 96.03% |
NOTE: See PERFORMANCE.md for details on accuracy breakdowns and known limitations.
Benchmark Setup
The benchmark test files are not included in this repository. To run benchmarks, first clone CSVsniffer and copy the test files:
# Clone CSVsniffer (if not already available)
# Copy test files to csv-nose
Running Benchmarks
Once the test files are in place:
# Run benchmark on POLLOCK dataset
# Run benchmark on W3C-CSVW dataset
# Run benchmark on CSV Wrangling dataset (all 179 files)
# Run benchmark on CSV Wrangling filtered CODEC (142 files)
# Run benchmark on CSV Wrangling MESSY (126 non-normal files)
# Run integration tests with detailed output
License
MIT OR Apache-2.0
Naming
The name "csv-nose" is a play on words, combining "CSV" (Comma-Separated Values) with "nose," suggesting the tool's ability to "sniff out" the correct CSV dialect. "Nose" also sounds like "knows," implying expertise in CSV dialect detection.
AI Contributions
Claude Code using Opus 4.5 was used to assist in code generation and documentation. All AI-generated content has been reviewed and edited by human contributors to ensure accuracy and quality.