kira_cdh_compat_clstr
CD-HIT–compatible .clstr utilities (writer, reader, and a semantic diff CLI) in Rust (edition 2024).
- Writer: emits cluster blocks
>Cluster Nwith member lines; the first member is marked with*. - Reader: parses cluster files and extracts member IDs using the conventional
>{id}...pattern. - Diff CLI: compares two
.clstrfiles semantically (as sets of sets), ignoring ordering differences.
This crate is intentionally small, deterministic, and production-friendly.
Installation
Use it inside a Cargo workspace:
[]
= "*"
Build the CLI (clstr-diff) too:
Format compatibility
The writer/reader adhere to the widely used subset of CD-HIT .clstr:
-
Cluster header:
>Cluster {number} -
Member lines:
-
With length prefix and unit (optional):
{ordinal}\t{length}{unit}, >{id}... {*}Examples:
150nt,or300aa, -
Without length prefix:
{ordinal}\t>{id}... {*} -
The first member is the representative and ends with
*.
-
ID extraction rule (reader): take the substring after the first > up to the first occurrence of ....
If ... is not present, the rest of the line after > is used. Surrounding whitespace is trimmed and a trailing comma is dropped.
Library API
use ;
// --- Writing ---
let headers = vec!;
let lengths = vec!;
let clusters = vec!; // indices into `headers`
let mut w = create?;
w.write_cluster?;
w.write_cluster?;
w.finish?;
// --- Reading ---
let parsed = read_clusters?;
assert_eq!;
assert_eq!;
assert_eq!;
# Ok::
Types
/// Length unit annotation for writer; use `None` to omit lengths.
/// Create a writer and emit clusters.
/// Parse clusters as `Vec<Vec<String>>` of member IDs.
;
/// Parse from any `Read`.
;
Example output
>Cluster 0
0 150nt, >seqA... *
1 140nt, >seqB...
>Cluster 1
0 130nt, >seqC... *
CLI: clstr-diff
Compare two .clstr files semantically (as partitions), ignoring the order of clusters and the order of members inside each cluster.
# Build
# Usage
Exit codes
0— partitions are semantically equal1— differences detected (reported to stderr)2— I/O or parse error
Notes
- The diff prints a limited sample of differing clusters for brevity.
- This is ideal for validating alternative implementations against a CD-HIT reference.
Integration tips
- Keep a mapping from your internal sequence indices to stable headers (IDs) to generate reproducible
.clstr. - For amino-acid data, pass
ClstrUnit::Aa; for nucleotides,ClstrUnit::Nt; orClstrUnit::Noneto omit lengths. - If you need strict parsing, validate your inputs before calling
read_clusters. The provided reader is intentionally tolerant of minor formatting differences (CD-HIT behaviour).
Performance
- Writer uses
BufWriterand performs O(n) emission over cluster members. - Reader is streaming and allocation-light; it parses line by line and extracts IDs without regexes.
- The diff CLI canonicalizes clusters to sets of sets using ordered containers to ensure deterministic results.
Testing
- Unit tests cover round-trip write/read and ID extraction semantics.
- For end-to-end validation, pair this crate with your clustering engine and run
clstr-diffagainst a known-good.clstrproduced by CD-HIT.
License
GPLv2.