Skip to main content

Crate csv_nose

Crate csv_nose 

Source
Expand description

csv-nose: CSV dialect sniffer using the Table Uniformity Method

A drop-in replacement for qsv-sniffer with improved dialect detection accuracy using the Table Uniformity Method from the CSVsniffer paper.

§Quick Start

use csv_nose::{Sniffer, SampleSize};

// Create a sniffer with default settings
let mut sniffer = Sniffer::new();

// Optionally configure sampling
sniffer.sample_size(SampleSize::Records(100));

// Sniff a file
let metadata = sniffer.sniff_path("data.csv").unwrap();

println!("Delimiter: {}", metadata.dialect.delimiter as char);
println!("Has header: {}", metadata.dialect.header.has_header_row);
println!("Fields: {:?}", metadata.fields);
println!("Types: {:?}", metadata.types);

§API Compatibility

This crate provides API compatibility with qsv-sniffer, making it easy to switch between implementations:

use csv_nose::{Sniffer, Metadata, Dialect, Header, Quote, Type, SampleSize, DatePreference};

let mut sniffer = Sniffer::new();
sniffer
    .sample_size(SampleSize::Records(50))
    .date_preference(DatePreference::MdyFormat)
    .delimiter(b',')
    .quote(Quote::Some(b'"'));

§The Table Uniformity Method

This library implements the Table Uniformity Method from: “Wrangling Messy CSV Files by Detecting Row and Type Patterns” by van den Burg, Nazábal, and Sutton (2019).

The algorithm achieves ~93% accuracy on real-world messy CSV files by:

  1. Testing multiple potential dialects (delimiter, quote, line terminator combinations)
  2. Scoring each dialect based on table uniformity (consistent field counts)
  3. Scoring based on type detection (consistent data types within columns)
  4. Selecting the dialect with the highest combined score

Re-exports§

pub use metadata::Dialect;
pub use metadata::Header;
pub use metadata::Metadata;
pub use metadata::Quote;

Modules§

metadata

Structs§

EncodingInfo
Information about the detected encoding.
Sniffer
CSV dialect sniffer using the Table Uniformity Method.

Enums§

DatePreference
Date format preference for ambiguous date parsing.
SampleSize
Sample size configuration for sniffing.
SnifferError
Error type for CSV sniffing operations.
Type
Data type detected for a CSV field.

Functions§

detect_encoding
Detect the encoding of the data.
is_utf8
Check if the given bytes are valid UTF-8.

Type Aliases§

Result
Result type alias for sniffing operations.