# Lightbench Design Document
## Overview
Lightbench is a transport-agnostic benchmarking library providing foundational
components for building high-performance latency and throughput benchmarks. It ships
with three ready-made benchmark patterns (request, producer/consumer, async task) and
a composable set of metrics, rate control, and output primitives.
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Application │
│ (HTTP benchmark, queue benchmark, custom benchmark) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Lightbench │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Metrics │ │ Rate │ │ Patterns │ │ Output │ │
│ │ (Stats, │ │ (Token │ │ (Request, │ │ (CSV, │ │
│ │ HDR, │ │ Bucket, │ │ Prod/Cons, │ │ Stdout) │ │
│ │ Sequence, │ │ Shared) │ │ AsyncTask) │ │ │ │
│ │ Errors) │ │ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ time_sync │ │ logging │ │
│ │ (fast ns) │ │ (tracing) │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Design Principles
### 1. Pattern-Based
The framework ships three benchmark patterns that encode common measurement workflows.
Users supply closures for the actual work; the framework handles rate control, worker
management, progress display, and CSV export.
### 2. Lock-Free Hot Paths
The `Stats` collector uses atomic counters for sent/received counts.
First-activity timestamps use `OnceLock<Instant>` — a bare atomic load after the first call.
Histogram contention is minimized by batching: each worker accumulates latencies in a
local `Vec<u64>` (capacity 64) and calls `record_received_batch()` when full, taking
one write lock per 64 records instead of one per request.
Single-record `record_received()` uses `write().await` for correctness.
### 3. Shared or Per-Worker Rate Control
`RateController` gives each worker an independent token bucket.
`SharedRateController` (lock-free atomic CAS) pools tokens across all workers so the
total rate matches the target exactly regardless of worker count.
## Component Details
### Patterns (`patterns/`)
Three high-level benchmark patterns, each a builder that accepts user-supplied closures.
**Benchmark** (`patterns/request.rs`) — request/response pattern
- `work(|| Box::pin(async { WorkResult::success(latency_ns) }))` closure per operation
- Workers share a `SharedRateController` (total mode) or each get a `RateController` (per-worker mode)
- Rate modes: `.rate(total)` (shared pool) or `.rate_per_worker(n)` (independent)
- Use `<=0` rate for unlimited throughput
- Returns `BenchmarkResults` with `print_summary()`, `throughput()`, `p99_latency_ms()`, etc.
**ProducerConsumerBenchmark** (`patterns/producer_consumer.rs`) — queue/pipeline pattern
- Producers: rate-controlled closures returning `Ok(())` on success or `Err(reason)` on error
- Consumers: free-running closures returning `Some(latency_ns)` when an item was consumed or `None` when the queue is empty (worker yields briefly)
- Returns `ProducerConsumerResults`
**AsyncTaskBenchmark** (`patterns/async_task.rs`) — submit-and-poll pattern
- Submit workers: rate-controlled closures returning `Some(task_id: u64)` on success or `None` on failure
- Poll workers: closures returning `PollResult::Completed { latency_ns }`, `PollResult::Pending`, or `PollResult::Error(reason)`
- Returns `AsyncTaskResults`
All three patterns share common builder methods: `.workers(n)`, `.duration_secs(n)`, `.csv(path)`, `.progress(bool)`.
### Metrics (`metrics/`)
**Stats**: Central statistics collector
- Atomic counters: `sent_count`, `received_count`, `error_count`
- HDR histogram for latency (1ns – 60s range, 3-digit precision)
- Connection tracking: `connections`, `active_connections`, `connection_attempts`, `connection_failures`
- Fault injection counters: `crashes_injected`, `reconnects`, `reconnect_failures`
- Quality metrics: `duplicate_count`, `gap_count`, `head_loss` (set by clients via `SequenceTracker`)
- Batch recording: `record_received_batch(&[u64])` takes a single histogram lock for multiple records
- Interval delta computation for throughput (tracks `last_sent_count`, `last_received_count`)
- First-activity timestamps via `OnceLock<Instant>` for accurate throughput (zero-cost after first call)
**StatsSnapshot**: Immutable point-in-time capture
- All counter values
- 5 latency percentiles (p25, p50, p75, p95, p99) plus min, max, mean, stddev
- Throughput: `total_throughput()` from first-activity time; `interval_throughput()` from last snapshot
- CSV serialization: 19-column format — core counters, throughputs, 10 latency fields, 3 quality metrics
**SequenceTracker**: Per-consumer duplicate/gap detection
- `HashSet`-based seen tracking
- `record(seq)` → `bool` (true = new, false = duplicate)
- `record_batch(&[u64])` → `usize` (count of new sequences)
- `duplicate_count()`: received more than once
- `gap_count()`: missing sequences within min..=max received range
- `head_loss()`: `min_seq` value (sequences lost before first received, assuming seq starts at 0)
- No shared state—each consumer owns one
**ErrorCounter** (`metrics/errors.rs`): Thread-safe error bucketing
- Groups errors by string reason in an `Arc<Mutex<HashMap>>`
- `record(reason)` increments the count for that bucket
- `take()` drains the map for final reporting
- `print_summary(errors)` pretty-prints a sorted breakdown
### Rate Control (`rate.rs`)
**RateController**: Per-worker token bucket
- Tick rate: min(rate, 100) ticks/sec (≥10ms intervals) to amortize timer overhead
- Q24.8 fixed-point for fractional tokens
- Burst allowance: up to 10 ticks worth
- `MissedTickBehavior::Burst` for catch-up after scheduling delay
- Not `Clone` or `Sync`—one per worker task
**SharedRateController**: Lock-free shared token bucket
- Atomic CAS token management; safe to `Arc`-clone across worker tasks
- Refills tokens based on elapsed nanoseconds on each `acquire()` call
- 100ms burst allowance (`rate * 0.1` tokens max)
- `acquire()` yields to the Tokio runtime while spinning for tokens
- Use `rate <= 0` for unlimited mode (`acquire()` returns immediately)
### Time Sync (`time_sync.rs`)
Fast nanosecond timestamps:
- Caches `(Instant, unix_ns)` base on first call via `OnceLock`
- Subsequent calls: `base_unix_ns + instant.elapsed()`
- Avoids repeated `SystemTime::now()` syscalls
- `now_unix_ns_estimate() -> u64`: current UNIX time in nanoseconds
- `latency_ns(start_ns: u64) -> u64`: elapsed since a prior `now_unix_ns_estimate()` call
### Logging (`logging.rs`)
Thin wrapper around `tracing-subscriber`:
- `logging::init(level)` — initialize with a string filter (e.g., `"info"`, `"debug"`)
- `logging::init_default()` — initialize at `info` level
- Returns `FrameworkError::Logging` on double-init (safe to `.ok()` in examples)
### Output (`output.rs`)
**OutputWriter**: Async metric output
- `Csv(BufWriter<File>)`: creates parent directories, writes CSV header, flushes after each row for tail-readable output
- `Stdout`: `println`-based
Methods: `new_csv(path)`, `new_stdout()`, `write_snapshot(&StatsSnapshot)`, `flush()`
### Error (`error.rs`)
**FrameworkError**: thin `thiserror` enum
- `FrameworkError::Logging(String)`: tracing-subscriber init failure
- `FrameworkError::Io(std::io::Error)`: CSV/file I/O failure
## Extension Points
### Using the Patterns
The simplest extension is supplying a work implementation to the request pattern.
Shared resources (URLs, config) go in the struct; per-worker resources that must
not be shared (HTTP clients, connections) go in `State` and are created in `init()`:
```rust
use lightbench::{Benchmark, BenchmarkWork, WorkResult, now_unix_ns_estimate};
#[derive(Clone)]
struct MyWork { url: String }
struct MyState { client: reqwest::Client }
impl BenchmarkWork for MyWork {
type State = MyState;
async fn init(&self) -> MyState {
MyState { client: reqwest::Client::new() } // one client per worker
}
async fn work(&self, state: &mut MyState) -> WorkResult {
let start = now_unix_ns_estimate();
// ... use state.client ...
WorkResult::success(now_unix_ns_estimate() - start)
}
}
Benchmark::new()
.rate(1000.0)
.workers(4)
.duration_secs(10)
.work(MyWork { url: "http://localhost/".into() })
.run()
.await
.print_summary();
```
### Custom Output Formats
Wrap `StatsSnapshot` directly:
```rust
use lightbench::StatsSnapshot;
fn to_json(snapshot: &StatsSnapshot) -> String {
serde_json::json!({
"timestamp": snapshot.timestamp,
"sent": snapshot.sent_count,
"received": snapshot.received_count,
"throughput": snapshot.total_throughput(),
"latency": {
"p50_ms": snapshot.latency_ns_p50 as f64 / 1_000_000.0,
"p99_ms": snapshot.latency_ns_p99 as f64 / 1_000_000.0,
}
}).to_string()
}
```
### Custom Rate Patterns
Use `RateController` directly for variable-rate workloads:
```rust
use lightbench::RateController;
use std::time::Duration;
async fn variable_rate_benchmark() {
let rates = vec![100.0, 500.0, 1000.0, 500.0, 100.0];
for rate in rates {
let mut controller = RateController::new(rate);
let end = tokio::time::Instant::now() + Duration::from_secs(10);
while tokio::time::Instant::now() < end {
controller.wait_for_next().await;
// send message...
}
}
}
```
## Performance Considerations
### Hot Path Optimization
1. **Sender**: `record_sent()` is atomic increment + `OnceLock::get_or_init` (bare atomic load after first call)
2. **Receiver (batch)**: workers buffer latencies locally in a `Vec<u64>` (cap 64), flushed via `record_received_batch()` — one histogram write lock per 64 records
3. **Receiver (single)**: `record_received()` uses `write().await` (always records, never drops)
4. **Rate (shared)**: `SharedRateController::acquire()` uses atomic CAS — no mutex on the hot path
5. **Per-worker state**: Use `BenchmarkWork::State` for resources that must not be shared (e.g. HTTP clients), created once per worker via `init()`
### Memory Usage
- `Stats`: ~1KB base + HDR histogram (~200KB for default precision)
- `SequenceTracker`: O(n) where n = unique sequences seen
- `ErrorCounter`: O(k) where k = unique error reasons
### Threading Model
- `Stats` is `Send + Sync`—share via `Arc<Stats>`
- `SharedRateController` is `Send + Sync`—share via `Arc<SharedRateController>`
- `RateController` is `Send` but not `Sync`—one per worker task
- `SequenceTracker` is single-threaded (no shared state; one per consumer)
## Future Considerations
1. **Prometheus Export**: Add optional prometheus metrics endpoint
2. **Percentile Customization**: Allow configuring which percentiles to track
3. **Windowed Stats**: Rolling window statistics (last N seconds)
4. **Binary Protocol**: Compact binary snapshot format for high-frequency export