cvmcount 0.1.1

use the CVM algorithm to quickly estimate the number of distinct elements in a stream
Documentation
cvmcount-0.1.1 has been yanked.

Rust implementation of the CVM counting algorithm

This library implements the algorithm described in

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

CLI Example

cargo install cvmcount
cvmcount file.txt 0.8 0.1 2900

The --help option is available.

Note

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Implementation Details

This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.