Crate chordclust

Expand description

Chordclust implements similarity clustering using rust-bio.

§Algorithm

The algorithm is a greedy search, similar to what is explained in https://www.drive5.com/usearch/manual/uclust_algo.html. It uses similarity instead of identity (for now)

Sort by sequence length (bigger is first).
For each sequence, compare it with the database of centroids:

If identity with best match > T: add to cluster of best match.
Else: form a new cluster.

Functions§

cluster_similarity: Cluster a buffer by similarity. This is to be used in examples but it is not bery useful.
cluster_slice: Cluster a slice of Strings by similarity. The elements of each cluster have s similarity > similarity_threshold with the centroid. k is the size of the k-mers used to perform the search.
read_fasta_sorted: Read the sequences inside a buffer in FASTA format and store it in a sorted Vec<String> by length. Based on the greedy nature of the algorithm, the first sequence that is seen will form a cluster,

Crate chordclust

Crate chordclust Copy item path

§Algorithm

Functions§

Crate chordclust