📄 csv_ingest
Rust Library for parsing CSV files from local files or any async source (AsyncRead). It focuses on high throughput, low memory, and correctness by default.
✨ Features
- Automatic decompression (gzip, zstd) via content‑encoding, content‑type, or file extension
- Optional transcoding to UTF‑8 using
encoding_rs - Streaming CSV parsing using
csv_async(no full‑file buffering) - Header validation to ensure required columns exist
- Optional fast local mode (mmap + parallel memchr) for uncompressed UTF‑8 CSVs
🚀 Quickstart
cargo add csv_ingest
If you need to parse from a remote source, construct an AsyncRead in your app (e.g., a reqwest byte stream) and pass it to build_csv_reader/process_csv_stream.
// pseudo
let = build_csv_reader;
let summary = process_csv_stream.await?;
// Stream & validate; returns headers + row_count
async ;
Minimal example (local file):
use ;
use Path;
async
🧑💻 Usage
📦 What this library returns (data shape)
- CsvIngestSummary: returned by
process_csv_stream(...)row_count: usizeheaders: Vec<String>(exact header strings from the first row)
- Streaming rows (when you iterate):
csv_async::ByteRecord- Access by index:
record.get(idx) -> Option<&[u8]> - Decode only if needed:
std::str::from_utf8(bytes)or parse to numbers as required - You typically resolve header indices once, then read those fields per row
- Access by index:
- Remote vs local: identical shapes; only the reader source differs
- Fast‑local (feature
fast_local): internal path optimized for local uncompressed CSVs- Library returns the same
CsvIngestSummary(and the bench can print an optional CRC for verification) - Assumptions are listed below; use the streaming API when those don’t hold
- Library returns the same
🌊 Streaming (recommended default)
Works for local files, gzip/zstd, and remote streams (HTTP via reqwest, etc.). You provide an AsyncRead and process ByteRecords, decoding only when needed.
use reader_from_path;
use ;
use Path;
#
# async
⚡️ Fast local mode (optional)
For local, uncompressed, UTF‑8 CSVs you control, enable the fast_local feature and use --fast-local in the bench. This path maps the file, splits by newline per core, and scans with memchr, extracting only required fields.
Assumptions:
- No embedded newlines inside fields (simple quoting only)
- Single‑byte delimiter (default
,) - Header is first line
Use --verify --limit to validate on a sample when benchmarking.
🛠️ CLI (dev helpers)
This repo ships two binaries to generate synthetic CSV data and measure throughput.
# Build release binaries (enable fast_local for the optional mmap path)
# Generate 100M rows and compress
|
|
# Run the bench (gzip / zstd / verify subset)
# Fast local path (uncompressed UTF‑8 CSVs)
Flags:
--required <col>: specify one or more required headers (repeatable)--verify: strict checks + CRC32 across fields (catches subtle differences)--limit <N>: limit processed rows (useful with--verify)--fast-local(requires--features fast_local): mmap + parallel scanning for local, uncompressed UTF‑8 CSVs
📈 Generating large datasets
# 1 billion rows (uncompressed)
# gzip
|
# zstd (often faster to read back)
|
# sanity checks
🧪 Notes on performance
- Gzip is typically the bottleneck; prefer zstd or uncompressed for peak throughput
- Put required columns early; the fast‑local path short‑circuits after the last required column
- Build with native CPU flags and release optimizations (already configured)
📄 License
MIT