Kira Bio Tools CD-Hit-compatible FASTQ reader
Streaming FASTQ reader with CD-HIT–compatible input handling (plain and .gz), a safe, idiomatic Rust API, and optional async support.
- Single-line FASTQ by default (sequence and quality occupy exactly one line each) to match common CD-HIT expectations.
- Multi-line FASTQ supported via an option for broader compatibility.
- Gzip auto-detection by extension or magic bytes.
- Error policy: skip malformed records (CD-HIT-like, default) or fail fast.
- Streaming: process records one-by-one without loading the whole file.
- Optional
mmapfor faster plain-file reads. - Async API behind the
asyncfeature (Tokio + async-compression). - MSRV pinned to Rust 1.85;
edition = 2024.
Table of contents
- Status
- Installation
- Features
- Design goals
- Quick start (sync)
- Quick start (async)
- Single-line vs multi-line FASTQ
- Error policy
- Resynchronization behavior
- API overview
- Performance notes
- Testing & benches
- Versioning & MSRV
- License
Status
- Production-ready streaming reader for FASTQ (plain and gzip).
- Cross-platform CI for Linux, macOS, Windows.
- Intended for use in pipelines where CD-HIT input behavior must be mirrored.
Installation
[]
= "*"
Optional features:
[]
= { = "0.1", = ["async", "mmap", "zlib"] }
gzip— enabled by default (gzip viaflate2with miniz_oxide backend).zlib— switchflate2to system zlib backend (closer to CD-HIT’s zlib path).mmap— enablememmap2for plain files (reduces syscalls).async— enable async API (Tokio + async-compression).
MSRV: 1.85.0 or newer (pinned).
Features
- CD-HIT–compatible defaults: single-line mode and a resilient “skip-bad-and-continue” policy.
- Auto gzip detection: by
.gzextension or magic bytes (1F 8B). - Streaming iterator: reads record-by-record; constant memory overhead regardless of file size.
- Clear error reporting: format errors include line/byte context.
- Minimal dependencies: core functionality keeps dependency surface small; performance extras are opt-in.
Design goals
- Safety first: no unsafe parsing; owned record buffers; clear error types.
- Predictable behavior: strong defaults that mirror CD-HIT expectations.
- Composability: easy to integrate in larger pipelines (sync or async).
- KISS/DRY: keep the public API small and focused.
Quick start (sync)
use ;
From stdin:
use ;
use ;
Quick start (async)
Enable the
asyncfeature:kira_cdh_compat_fastq_reader = { version = "0.1", features = ["async"] }
use ;
async
You can also wrap any AsyncBufRead via:
// AsyncFastqReader::from_async_bufread(reader, opts)
Single-line vs multi-line FASTQ
-
Single-line (default): after the
@header, sequence is exactly one line,+is one line, quality is exactly one line. This matches typical Illumina output and how CD-HIT often sees inputs. -
Multi-line: sequence and/or quality may span multiple lines. Enable via:
ReaderOptions
Note: Single-line mode is both stricter and faster. If your datasets are multi-line, switch to LineMode::Multi.
Error policy
Typical format errors include:
- Missing header
@(or encountering FASTA>in FASTQ-only mode). - Missing
+line. - Unexpected EOF inside a record.
- Length mismatch between sequence and quality.
- Empty sequence.
All errors carry an I/O context (byte offset and line number).
Resynchronization behavior
With ErrorPolicy::Skip, the parser attempts to resynchronize at the next line starting with @. This mirrors the robust “keep going” behavior often expected in CD-HIT pipelines when inputs contain occasional malformed records.
API overview
Types
FastqReader— synchronous streaming reader (plain or.gz).AsyncFastqReader— asynchronous streaming reader (featureasync).FastqRecord—{ id, desc: Option<String>, seq: Vec<u8>, qual: Vec<u8> }.ReaderOptions—{ error_policy, fastq_only, line_mode }.ErrorPolicy—SkiporReturn.LineMode—SingleorMulti.FastqError/FormatError— detailed error types with context.
Construction
// sync
let mut r = from_path?;
// or
let mut r = from_bufread;
// async
let mut ar = from_path.await?;
// or
let mut ar = from_async_bufread;
Iteration
// sync
for item in &mut r
// async
while let Some = ar.next_record.await
Performance notes
-
Plain FASTQ +
mmap(--features mmap): can reduce syscalls and improve throughput on fast storage (commonly +5–30% vs buffered reads). -
Gzip:
- Default
flate2backend (miniz_oxide) provides solid performance. --features zlibswitches to system zlib for closer parity with CD-HIT’s zlib path.
- Default
-
I/O-bound workloads benefit most from larger buffers and sequential access patterns; CPU-bound cases (e.g., heavy downstream processing) usually dwarf parse costs.
Use cargo bench to evaluate on your hardware and datasets.
Testing & benches
# default features (gzip enabled)
# all features
# benches
Tests cover:
- Basic parsing (single-line).
- Gzip files.
- Skip vs Return policies.
- Async parsing (behind
asyncfeature).
Versioning & MSRV
- MSRV: >=1.85.
- SemVer: public API follows semantic versioning. Breaking changes trigger a major version bump.
License
Licensed under GPLv2 like a CD-Hit.