<div align="center">
<img src="logo.png" alt="nano-hyperloglog" width="200"/>
# nano-hyperloglog
[](https://crates.io/crates/nano-hyperloglog)
[](https://docs.rs/nano-hyperloglog)
[](https://github.com/aovestdipaperino/nano-hyperloglog#license)
A high-performance HyperLogLog implementation in Rust for cardinality estimation with pluggable storage backends.
</div>
---
## What is HyperLogLog?
HyperLogLog is a probabilistic data structure that estimates the cardinality (number of unique elements) of a dataset. It's incredibly space-efficient: count billions of unique items using just kilobytes of memory, with typical accuracy within 2% of the true count.
**Use cases:**
- Unique visitor counting for web analytics
- Distinct user tracking across distributed systems
- Database query optimization (cardinality estimation)
- Network traffic analysis
- A/B testing metrics
## Features
- 🚀 **Fixed memory usage** - Count billions of items with ~16KB (configurable)
- 🎯 **High accuracy** - Typically within 0.8-2% of true count
- 🔀 **Mergeable** - Combine counts from multiple sources effortlessly
- 💾 **Pluggable storage** - File-based or Elasticsearch backends
- 🌐 **HTTP server** - Optional Redis-compatible REST API (PFADD/PFCOUNT/PFMERGE)
- ⚡ **Zero-copy operations** - Efficient serialization/deserialization
- 🦀 **Type-safe** - Leverage Rust's type system for compile-time guarantees
- ✅ **Well-tested** - Comprehensive test suite with edge cases
## Quick Start
Add to your `Cargo.toml`:
```toml
[dependencies]
nano-hyperloglog = "0.1"
```
### Basic Usage
```rust
use hyperloglog::HyperLogLog;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create with precision 14 (16KB memory, ~0.8% error)
let mut hll = HyperLogLog::new(14)?;
// Add elements
for i in 0..100_000 {
hll.add(&i);
}
// Get estimated count
let count = hll.count();
println!("Estimated: {} (actual: 100,000)", count);
// Output: Estimated: 99,723 (actual: 100,000) - 0.28% error!
Ok(())
}
```
### Merging Counts
Perfect for distributed systems:
```rust
use hyperloglog::HyperLogLog;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Three servers tracking visitors
let mut server1 = HyperLogLog::new(14)?;
let mut server2 = HyperLogLog::new(14)?;
let mut server3 = HyperLogLog::new(14)?;
// Each sees different (potentially overlapping) users
for i in 0..5000 { server1.add(&format!("user_{}", i)); }
for i in 2500..7500 { server2.add(&format!("user_{}", i)); }
for i in 5000..10000 { server3.add(&format!("user_{}", i)); }
// Merge to get total unique users
server1.merge(&server2)?;
server1.merge(&server3)?;
println!("Total unique users: {}", server1.count());
// Output: Total unique users: ~10,000
Ok(())
}
```
### Persistent Storage
```rust
use hyperloglog::{HyperLogLog, Storage, storage::FileStorage};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let storage = FileStorage::new("./data").await?;
// Create and populate
let mut hll = HyperLogLog::new(14)?;
hll.add_str("user123");
hll.add_str("user456");
// Persist to disk
storage.store("daily_visitors", &hll).await?;
// Load later
let loaded = storage.load("daily_visitors").await?;
println!("Count: {}", loaded.count());
Ok(())
}
```
## Precision and Memory Tradeoffs
| 10 | 1 KB | ±1.625% | Quick estimates, tight memory |
| 12 | 4 KB | ±0.813% | Good balance |
| 14 | 16 KB | ±0.406% | **Default** - recommended |
| 16 | 64 KB | ±0.203% | High accuracy needed |
## Feature Flags
Control what gets compiled:
```toml
[dependencies]
nano-hyperloglog = { version = "0.1", features = ["elasticsearch-storage", "server"] }
```
Available features:
- `file-storage` (default) - File-based persistence
- `elasticsearch-storage` - Elasticsearch backend
- `server` - HTTP server with Redis-compatible API
- `full` - Everything
## HTTP Server
Run a Redis-compatible HTTP service:
```bash
cargo run --example server --features server
```
### API Endpoints
```bash
# Add elements (PFADD)
curl -X POST http://localhost:3000/pfadd/daily_visitors \
-H "Content-Type: application/json" \
-d '{"elements": ["user123", "user456", "user789"]}'
# Get count (PFCOUNT)
curl http://localhost:3000/pfcount/daily_visitors
# {"count": 3}
# Merge multiple HLLs (PFMERGE)
curl -X POST http://localhost:3000/pfmerge/all_visitors \
-H "Content-Type: application/json" \
-d '{"source_keys": ["page_home", "page_about"]}'
# Check existence
curl http://localhost:3000/exists/daily_visitors
# true
# List all keys
curl http://localhost:3000/keys
# ["daily_visitors", "all_visitors"]
```
### Configuration
Via environment variables:
```bash
# Storage backend
STORAGE_BACKEND=file # or "elasticsearch"
FILE_STORAGE_PATH=./data # for file backend
ELASTICSEARCH_URL=http://localhost:9200
ELASTICSEARCH_INDEX=hyperloglog
# Server
BIND_ADDRESS=0.0.0.0:3000
cargo run --example server --features server
```
## Examples
Run examples to see it in action:
```bash
# Basic counting
cargo run --example basic_usage
# Merging from multiple sources
cargo run --example merging
# File storage
cargo run --example file_storage --features file-storage
# Compare precision values
cargo run --example precision_comparison
# HTTP server
cargo run --example server --features server
```
## How It Works
HyperLogLog uses a clever trick based on probability theory:
1. **Hash each element** to get a uniform bit pattern
2. **Count leading zeros** in the hash (rare for few items, common for many)
3. **Partition** into multiple "registers" to reduce variance
4. **Estimate cardinality** from the average leading zero count
For technical details, see [the algorithm explanation](https://en.wikipedia.org/wiki/HyperLogLog).
## Performance
Benchmarks on M1 MacBook Pro:
| Add element | ~15 ns | 66M ops/sec |
| Count (precision 14)| ~8 μs | 125K ops/sec |
| Merge (16KB each) | ~12 μs | 83K ops/sec |
| Serialize to JSON | ~50 μs | 20K ops/sec |
Memory usage is fixed: `2^precision` bytes (16KB for precision 14).
## Why Rust?
- **Zero-cost abstractions** - Fast as C, safe as high-level languages
- **Fearless concurrency** - Share HyperLogLogs across threads safely
- **No GC pauses** - Predictable latency for real-time systems
- **Small binaries** - Server example compiles to ~8MB
## Comparison with Redis
| Precision configurable | ✅ 4-16 bits | ✅ 14 bits |
| Persistent storage | ✅ File/ES | ✅ RDB/AOF |
| HTTP API | ✅ Optional | ❌ |
| Type-safe API | ✅ | ❌ |
| Startup time | < 10ms | ~100ms |
| Memory footprint | ~8MB binary | ~10MB process |
**When to use each:**
- **Redis**: Already using Redis, need sub-millisecond latency, want battle-tested stability
- **nano-hyperloglog**: Want dedicated service, need custom storage, building in Rust, serverless deployments
## Contributing
Contributions welcome! Please:
1. Add tests for new features
2. Run `cargo fmt` and `cargo clippy`
3. Update documentation
## License
Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
## References
- [Original HyperLogLog paper](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) by Flajolet et al. (2007)
- [HyperLogLog++ improvements](https://research.google/pubs/pub40671/) by Google (2013)
- [Redis HyperLogLog implementation](https://redis.io/commands/pfcount/)
## Acknowledgments
Inspired by Redis's PFCOUNT and the excellent work of Philippe Flajolet on probabilistic counting algorithms.