Expand description
csv-nose: CSV dialect sniffer using the Table Uniformity Method
A drop-in replacement for qsv-sniffer with improved dialect detection accuracy using the Table Uniformity Method from the CSVsniffer paper.
§Quick Start
use csv_nose::{Sniffer, SampleSize};
// Create a sniffer with default settings
let mut sniffer = Sniffer::new();
// Optionally configure sampling
sniffer.sample_size(SampleSize::Records(100));
// Sniff a file
let metadata = sniffer.sniff_path("data.csv").unwrap();
println!("Delimiter: {}", metadata.dialect.delimiter as char);
println!("Has header: {}", metadata.dialect.header.has_header_row);
println!("Fields: {:?}", metadata.fields);
println!("Types: {:?}", metadata.types);§API Compatibility
This crate provides API compatibility with qsv-sniffer, making it easy to switch between implementations:
use csv_nose::{Sniffer, Metadata, Dialect, Header, Quote, Type, SampleSize, DatePreference};
let mut sniffer = Sniffer::new();
sniffer
.sample_size(SampleSize::Records(50))
.date_preference(DatePreference::MdyFormat)
.delimiter(b',')
.quote(Quote::Some(b'"'));§The Table Uniformity Method
This library implements the Table Uniformity Method from: “Wrangling Messy CSV Files by Detecting Row and Type Patterns” by van den Burg, Nazábal, and Sutton (2019).
The algorithm achieves ~93% accuracy on real-world messy CSV files by:
- Testing multiple potential dialects (delimiter, quote, line terminator combinations)
- Scoring each dialect based on table uniformity (consistent field counts)
- Scoring based on type detection (consistent data types within columns)
- Selecting the dialect with the highest combined score
Re-exports§
pub use metadata::Dialect;pub use metadata::Header;pub use metadata::Metadata;pub use metadata::Quote;
Modules§
Structs§
- Encoding
Info - Information about the detected encoding.
- Sniffer
- CSV dialect sniffer using the Table Uniformity Method.
Enums§
- Date
Preference - Date format preference for ambiguous date parsing.
- Sample
Size - Sample size configuration for sniffing.
- Sniffer
Error - Error type for CSV sniffing operations.
- Type
- Data type detected for a CSV field.
Functions§
- detect_
encoding - Detect the encoding of the data.
- is_utf8
- Check if the given bytes are valid UTF-8.
Type Aliases§
- Result
- Result type alias for sniffing operations.