kira_cdh_compat_cluster
Greedy clustering for CD-HIT–like pipelines in Rust (edition 2024).
- Input: an LSH index and fixed-length MinHash/KMV signatures (one per sequence), plus optional sequence lengths.
- Similarity: MinHash-based Jaccard estimate (fraction of equal positions in two signatures).
- Coverage gating: optional CD-HIT–style
-aS/-aLchecks derived from Jaccard and sequence lengths (no alignment needed). - Deterministic: stable order; optional length-first processing to mimic CD-HIT.
- Parallel: optional multi-threaded candidate checks via the
parallelfeature.
This crate is designed to “glue” a candidate retrieval stage (e.g., LSH) with a lightweight, deterministic greedy clustering engine that emits cluster partitions compatible with CD-HIT expectations.
Installation
Add to your Cargo workspace (recommended):
[]
= "*"
# If you need multi-threaded candidate checks:
[]
= []
# enable at build time: --features parallel
The engine expects LSH and MinHash/KMV signatures from relative crates, e.g.:
kira_cdh_compat_fastq_reader— zero-copy FASTQ/FASTA reading (sync/async), with gzip support.kira_cdh_compat_kmer_indexer— k-mer indexer.kira_cdh_compat_lsh— KMV sketches and LSH index (insert/build/query).
Quickstart
use ;
use ;
use FastqReader;
// 1) Build MinHash/KMV signatures and lengths
let mut signatures: = Vecnew; // all same length (e.g., 128)
let mut lengths: = Vecnew;
let k: usize = 11; // k-mer size used for hashing (tune per dataset)
let mut r = from_path?;
for item in &mut r
// 2) Build an LSH index for candidate retrieval
let params = new?; // bands, rows — tune for your target similarity
let mut index = with_params;
for in signatures.iter.enumerate
index.build;
// 3) Run greedy clustering
let opts = ClusterOptions ;
let clusterer = GreedyClusterer ;
let result = clusterer.run;
println!;
// result.clusters is Vec<Vec<usize>> (sequence indices per cluster)
To emit CD-HIT-compatible
.clstr, pair this crate with a writer (e.g., a smallkira_cdh_compat_clstrutility crate) and print:>Cluster 0 0\t150nt, >seqA... * 1\t140nt, >seqB... ...
Coverage gating (CD-HIT-style -aS/-aL)
When sequences differ in length, a high MinHash Jaccard can still arise from partial overlap.
CD-HIT mitigates this with short/long coverage gates.
We provide a fast approximation that requires only Jaccard, lengths, and k.
-
Let
L_sandL_lbe short/long sequence lengths;kis the k-mer size. -
Window counts:
A = max(L_s - k + 1, 0) B = max(L_l - k + 1, 0) -
With
Ĵthe MinHash Jaccard estimate (equal positions / signature length), approximate the number of shared windows:I ≈ Ĵ * (A + B) / (1 + Ĵ) -
Coverage for short/long:
cov_short = I / A cov_long = I / B -
Accept a candidate if:
Ĵ ≥ similarity_threshold && (cov_short ≥ min_coverage_short if set) && (cov_long ≥ min_coverage_long if set)
Enabling coverage gates:
let opts = ClusterOptions ;
Practical profiles (nucleotide)
| Task | k | similarity_threshold |
min_coverage_short |
min_coverage_long |
Notes |
|---|---|---|---|---|---|
| High-identity clustering | 9–11 | 0.95–0.99 | 0.90–0.98 | 0.0–0.5 | Stable contigs/transcripts |
| 97% OTU/ASV-like grouping | 9–12 | 0.90–0.97 | 0.85–0.95 | 0.0–0.5 | 16S/SSU; set -aS slightly below Ĵ |
| Aggressive deduplication | 11–13 | 0.98–0.995 | 0.95–0.99 | 0.0–0.5 | Long reads or assembled contigs |
Hints:
- Larger
ktightens windows—great for high identity, too strict for noisy data. - Use length-first ordering (
sort_by_length=true) to reduce false merges. - Tune LSH
(bands, rows)to bend the collision curve near your target Jaccard.
API overview
/// MinHash Jaccard estimate (equal positions / signature length).
;
- The engine assumes all signatures share the same length.
LshIndexshould return candidate sequence IDs (not internal bucket IDs).- Deterministic behavior: the processing order is fixed; with
sort_by_length=true, longer sequences seed first.
Performance tips
- Enable
--features parallelto parallelize similarity checks within candidate sets. - Batch insert/build in LSH; use “insert_batch” APIs when available to reduce allocation churn.
- Keep signature lengths moderate (e.g., 128–256 u64s); increase only if collision rates are too high.
- Use a fast, stable k-mer hash (SIMD-friendly, no allocations).
Examples
examples/greedy_basic.rs— minimal run with synthetic signatures.examples/greedy_coverage.rs— same with coverage gating enabled.
Run:
License
GPLv2