heavykeeper-rs
Top-K Heavykeeper algorithm for Top-K elephant flows
This is based on the paper HeavyKeeper: An Accurate Algorithm for Finding Top-k Elephant Flows by Junzhi Gong, Tong Yang, Haowei Zhang, and Hao Li, Peking University; Steve Uhlig, Queen Mary, University of London; Shigang Chen, University of Florida; Lorna Uden, Staffordshire University; Xiaoming Li, Peking University
Example
See examples/basic.rs for a complete example, or examples/ip_files.rs for an example of counting top-k IP flows in a file.
Basic usage:
use TopK;
// create a new TopK with k=10, width=1000, depth=4, decay=0.9
let mut topk: = new;
// add some items
topk.add;
topk.add;
// check the counts
for node in topk.list
Variants
The crate ships three top-K sketches that share the same public API
(new / with_seed / with_hasher / builder / add / count /
query / list / merge):
| Sketch | Layout | Insert throughput on Zipf(s=1.2), 1M | Recall @ φ=0.0005 |
|---|---|---|---|
TopK |
depth independent rows × width buckets |
21.0 Melem / s | 0.942 |
BucketedTopK |
one bucket of depth cells per key |
29.0 Melem / s | 0.985 |
CuckooTopK |
per-bucket lobby + depth heavy slots, 2-bucket cuckoo |
29.8 Melem / s | 1.000 |
Numbers are from cargo bench --bench topk_vs_bucketed at K=100, width=4096, depth=4, decay=0.9 on u64 keys. Recall is from
tests/accuracy_compare.rs (paper-style heavy-hitter test, φ = 0.0005,
1 M Zipf samples).
TopK is the canonical implementation from the paper, with its
accuracy bounds. BucketedTopK and CuckooTopK are derived variants —
they don't carry the paper's row-independence accuracy bounds, but the
empirical accuracy on Zipf streams is competitive and often better.
Pick by workload:
TopK— when you want the published algorithm and its bounds.BucketedTopK— best general-purpose insert throughput; closest toTopK's cost model with a single bucket per key.CuckooTopK— best accuracy and throughput on heavy-hitter-skewed traffic (the elephant-flow use case). Each bucket has a single lobby cell with probabilistic decay plusdepthnon-decaying heavy slots; promoted items live in one of two cuckoo candidate buckets and are re-homed on collision via a kick chain (bound configurable viaCuckooBuilder::max_kicks, default 8).
All three support seedable construction, custom hashers, and merge
between compatible instances. Errors are returned via
BuilderError/MergeError enums; the infallible constructors
(new, with_seed, with_hasher) trust the caller. Bucket indexing
uses an AND-mask fast-path when width is a power of two; pick
power-of-two widths in production for the best per-add cost.
Real-world packet trace
examples/ip_files.rs runs all three sketches over a CAIDA-style trace
(27.5 M packets, 1.03 M distinct 13-byte flow keys = src IP : src port →
dst IP : dst port + protocol). Same K=1000, decay=0.95, equal cell
budgets across variants:
| Sketch | Width × depth | Throughput | hit_ratio | ARE on reported | ARE on true top-K |
|---|---|---|---|---|---|
TopK |
16384 × 2 | 14.1 Mpps | 0.9270 | 0.0050 | 0.0745 |
BucketedTopK |
8192 × 4 | 18.1 Mpps | 0.9860 | 0.0035 | 0.0129 |
CuckooTopK |
8192 × (1 + 4 heavy) | 17.0 Mpps | 0.9990 | 0.0012 | 0.0012 |
Other HeavyKeeper Implementations
| Name | Language | Github Repo |
|---|---|---|
| SegmentIO | Go | https://github.com/segmentio/topk |
| Aegis | Go | https://github.com/go-kratos/aegis/blob/main/topk/heavykeeper.go |
| Tomasz Kolaj | Go | https://github.com/migotom/heavykeeper |
| HeavyKeeper Paper | C++ | https://github.com/papergitkeeper/heavy-keeper-project |
| Jigsaw-Sketch | C++ | https://github.com/duyang92/jigsaw-sketch-paper/tree/main/CPU/HeavyKeeper |
| Redis Bloom Heavykeeper | C | https://github.com/RedisBloom/RedisBloom/blob/master/src/topk.c |
| Count-Min-Sketch | Rust | https://github.com/alecmocatta/streaming_algorithms |
| Sliding Window HeavyKeeper | Go | https://github.com/keilerkonzept/topk |
Running
Word Count Example
A word count program that demonstrates the HeavyKeeper algorithm can be found at examples/word_count.rs.
Usage
Running the basic example
Running the IPv4 example
Run the benchmarks
Benchmark the sample word count app
Test Data
For information about test data format and how to obtain or generate test data, please see data/README.md.
License
This project is dual licensed under the Apache/MIT license.