cvmcount 0.1.0

use the CVM algorithm to quickly estimate the number of distinct elements in a stream
Documentation
cvmcount-0.1.0 has been yanked.

Rust implementation of the CVM counting algorithm

This library implements

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The blog post describing the algorithm is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

CLI Example

cvmcount file.txt 0.8 0.1 2900

The --help option is available.

Note

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Implementation Details

This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.