base2histogram 0.2.2

A Rust histogram library using base-2 logarithmic bucketing for fast percentile estimation
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
# base2histogram Algorithm

A detailed description of the algorithm used in [base2histogram](https://github.com/drmingdrmer/base2histogram).

## Problem

Track request latency distributions in real time with:
- **O(1) recording** — no sorting, no rebalancing, nothing that stalls a hot path
- **Tiny memory** — the system may run hundreds of histograms concurrently
- **Accurate percentile queries** — P50, P95, P99

## Overview

base2histogram is a fixed-size logarithmic histogram.
It maps each `u64` value to one of 252 buckets (at the default WIDTH=3) using pure integer arithmetic,
then estimates percentiles via trapezoidal density interpolation.

The design has three key ideas:

1. **Float-like encoding** for O(1) bucket indexing with no floating-point math
2. **Trapezoidal interpolation** for accurate percentile estimation with no extra storage
3. **CDF-based re-binning** for lossless rescale between different bucket resolutions

## Part 1: Recording — Float-Like Bucket Encoding

### Why Log-Scale Buckets

Latency follows a [log-normal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution)
— a peak at lower values with a [long tail](https://en.wikipedia.org/wiki/Long_tail) to the right.
On a [log scale](https://en.wikipedia.org/wiki/Logarithmic_scale), this becomes a bell curve.

Equal-width buckets waste resolution: tiny buckets where data is sparse, huge gaps where data is dense.
Log-scale buckets — each roughly twice as wide as the previous — match the shape of the data:

```
[0,1), [1,2), [2,4), [4,8), [8,16), ...
```

Mapping a value to its bucket takes a single [leading zero count](https://en.wikipedia.org/wiki/Find_first_set#CLZ) instruction. 65 buckets cover the entire u64 range. But the resolution is too coarse — the last bucket spans half of all possible values.

### The Float-Like Encoding

The solution: treat each bucket's lower bound as a tiny
[floating-point number](https://en.wikipedia.org/wiki/Floating-point_arithmetic).
The [MSB](https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit) position gives the exponent
(which group of buckets), and the next `WIDTH-1` bits give the offset within that group.

With WIDTH=3 (default), a bucket boundary looks like this in binary:

```
00..00 1 xx 00..00
       |
       MSB
```

The leading `1` picks the group. The two bits that follow pick the bucket within the group.
It is a 3-bit float: 1 implicit leading bit + 2 fractional bits.

### Bucket Layout

With WIDTH=3, each group contains `2^(WIDTH-1) = 4` buckets.
The first two groups (values 0–7) are exact 1:1 mappings.
From group 2 onward, bucket widths double with each group:

```
WIDTH = 3:

range     bucket index        bucket size
[0, 1)     0  0b0 ..... 000    1
[1, 2)     1  0b0 ..... 001    1
[2, 3)     2  0b0 ..... 010    1
[3, 4)     3  0b0 ..... 011    1

[4, 5)     4  0b0 ..... 100    1
[5, 6)     5  0b0 ..... 101    1
[6, 7)     6  0b0 ..... 110    1
[7, 8)     7  0b0 ..... 111    1

[8, 10)    8  0b0 .... 1000    2
[10, 12)   9  0b0 .... 1010    2
[12, 14)  10  0b0 .... 1100    2
[14, 16)  11  0b0 .... 1110    2

[16, 20)  12  0b0 ... 10000    4
[20, 24)  13  0b0 ... 10100    4
[24, 28)  14  0b0 ... 11000    4
[28, 32)  15  0b0 ... 11100    4

[32, 40)  16  0b0 .. 100000    8
[40, 48)  17  0b0 .. 101000    8
[48, 56)  18  0b0 .. 110000    8
[56, 64)  19  0b0 .. 111000    8
```

### Bucket Index Calculation

For a value `v`:

1. If `v < GROUP_SIZE` (4 for WIDTH=3): `bucket_index = v` (direct mapping)
2. Otherwise, extract the top WIDTH bits:

```
bits_upto_msb = 64 - leading_zeros(v)
group_index   = bits_upto_msb - WIDTH
offset        = (v >> group_index) & MASK
bucket_index  = GROUP_SIZE + group_index * GROUP_SIZE + offset
```

Walk-through with value = 42:

```
value = 42 (binary: 0b101010)
  MSB position: 5
  group_index: 5 - 2 = 3
  2 bits after MSB: 01 (from 1[01]010)
  offset in group: 1
  Bucket index: 4 + (3 × 4) + 1 = 17
```

The entire operation is `leading_zeros` + two shifts + a mask + two adds.
No floating-point, no division, no `log()`. Recording is O(1).

### Performance Optimizations

- **Small-value cache**: a precomputed lookup table for values 0–4095, returning the bucket index in a single memory access
- **Precomputed boundaries**: `bucket_min_values[252]` stores the left boundary of every bucket, computed once at static initialization

### Tuning WIDTH

WIDTH sets how many sub-buckets each group gets (`2^(WIDTH-1)`).
Groups top out at 64 (covering the full u64 range).

| WIDTH | Buckets per group | Total buckets | Memory/slot |
|-------|-------------------|---------------|-------------|
| 1     | 1                 | 65            | 520 B       |
| 2     | 2                 | 128           | 1.0 KB      |
| 3     | 4                 | 252           | 2.0 KB      |
| 4     | 8                 | 496           | 3.9 KB      |
| 5     | 16                | 976           | 7.6 KB      |
| 6     | 32                | 1,920         | 15.0 KB     |

Total bucket count: `2^(WIDTH-1) × (66 - WIDTH)`.
Memory per slot: `total_buckets × 8` bytes (u64 counters).

## Part 2: Percentile Estimation — Trapezoidal Interpolation

### Locating the Target Bucket

For a percentile `p` (e.g., 0.99 for P99):

1. Compute total sample count
2. Compute target rank: `rank = ceil(total × p)`
3. Scan buckets from index 0, accumulating counts
4. When cumulative count ≥ rank, the current bucket contains the target

This scan is O(bucket_count) — 252 iterations at WIDTH=3.

### The Problem: Bucket Resolution

A bucket spans a range, not a point. The target value is somewhere inside the bucket, but where? Three approaches, from rough to precise:

**Midpoint**: return `(left + right) / 2`. A blind guess that ignores sample distribution within the bucket.

**Uniform interpolation**: assume samples are evenly spread, then interpolate linearly: `estimate = left + width × rank_in_bucket / count`. Better than midpoint — at least it uses where in the bucket the target rank falls.

**Trapezoidal interpolation** (our approach): model a density gradient using neighbor buckets. Two orders of magnitude more accurate, with zero additional storage.

Measured error on log-normal distribution (API latency scenario), WIDTH=3, 1,000,000 samples:

| Method | P50 | P95 | P99 |
|--------|-----|-----|-----|
| midpoint | 5.018% | 7.732% | 4.861% |
| trapezoidal | 0.000% | 0.080% | 0.086% |

### How Trapezoidal Interpolation Works

Uniform interpolation treats the density inside a bucket as flat.
In reality, it is sloped — denser on the side closer to the peak of the distribution.

If we know which way the density tilts, we can swap the rectangle for a trapezoid and land much closer to the true value.

Each bucket stores only a count. The slope information comes from the **neighbors**.

#### Setup

Consider three adjacent buckets:

```
  bucket:  [i-1]      [i]        [i+1]
  range:   [x0,x1)    [x1,x2)    [x2,x3)
  width:     w0          w1          w2
  count:     c0          c1          c2
```

Compute average density in each:

```
d0 = c0 / w0       (left neighbor density)
d1 = c1 / w1       (target bucket density)
d2 = c2 / w2       (right neighbor density)
```

Midpoints of the left and right buckets:

```
m0 = (x0 + x1) / 2
m2 = (x2 + x3) / 2
```

#### Density Slope

Assume density varies linearly from `m0` to `m2`. The slope:

```
k = (d2 - d0) / (m2 - m0)
```

For the first bucket, uses (self, right neighbor). For the last bucket, uses (left neighbor, self).

#### Density Slope Clamping

A steep slope from distant neighbors can produce negative density at a bucket edge, which is physically meaningless. The clamped slope ensures non-negative density across the entire bucket width:

```
d(x) = a + k·x,  x ∈ [0, w]
```

where `a = d1 - k·w/2` is the density at the left edge.

The constraints are:
- Left edge:  `d1 - k·w/2 ≥ 0``k ≤ 2·d1/w`
- Right edge: `d1 + k·w/2 ≥ 0``k ≥ -2·d1/w`

If the raw slope violates either bound, it is clamped to the tightest limit.

#### Solving for the Percentile Position

Let `x ∈ [0, w]` be the offset from the bucket's left edge (where `w` is the bucket width).

The density inside the bucket is a linear function:

```
d(x) = a + k·x
```

where `a = d1 - k·w/2` is the density at the left edge, and `k` is the clamped density slope. This density is anchored so that the midpoint density equals the bucket's average density `d1` (the midpoint of a linear function always equals its average).

The CDF within the bucket:

```
C(x) = a·x + k·x²/2
```

The total area `C(w) = count`, confirming the density integrates to the bucket's sample count.

To find the target position, solve `C(x) = rank` where `rank` is the 1-based rank within the bucket:

```
k/2 · x² + a · x - rank = 0
```

Solving via the quadratic formula:

```
x = (-a + √(a² + 2·k·rank)) / k
```

The final estimate:

```
value = left + floor(x),  x clamped to [0, w]
```

#### Fallback Cases

- **Single-count or width-1 bucket**: return the midpoint
- **Near-zero slope** (`|k·w| < |d1| × 10⁻⁹`): effectively uniform → `x = rank / d1`
- **Negative discriminant**: fall back to `x = rank / d1`

## Error Bound

### Theoretical Bound

For a group with bit-length `b` (values in `[2^(b-1), 2^b)`):

- Range spans `2^(b-1)` values
- Divided into `2^(WIDTH-1)` sub-buckets
- Each bucket width = `2^(b-1) / 2^(WIDTH-1)` = `2^(b - WIDTH)`

Worst-case relative error at the smallest value in a group:

```
relative_error = bucket_width / min_value
               = 2^(b - WIDTH) / 2^(b - 1)
               = 1 / 2^(WIDTH - 1)
```

The `b` cancels out — the ratio is constant across all groups. This is the fundamental property of logarithmic bucketing: as values double, bucket width doubles too, so relative error stays constant.

If the percentile returns the **midpoint** of the bucket, the max error halves to `1 / 2^WIDTH`.

| WIDTH | Max error (full bucket) | Max error (midpoint) |
|-------|------------------------|---------------------|
| 3     | 1/4 = 25%              | 1/8 = 12.5%         |
| 4     | 1/8 = 12.5%            | 1/16 = 6.25%        |
| 5     | 1/16 = 6.25%           | 1/32 = 3.125%       |
| 6     | 1/32 = 3.125%          | 1/64 = 1.5625%      |
| 8     | 1/128 = 0.781%         | 1/256 = 0.391%      |

### Practical Accuracy

Trapezoidal interpolation reduces practical error well below the theoretical bound for smooth distributions.

Measured with 1,000,000 samples per distribution, WIDTH=3 (252 buckets, ~2.1 KB per slot):

```
                   P50      P95      P99
─────────────────────────────────────────
LN-API  (σ=0.5)  0.000%   0.080%   0.086%    API / microservice latency
LN-DB   (σ=1.0)  0.034%   0.039%   0.187%    database query latency
Expon             0.000%   0.000%   0.824%    network / IO waits
Bimodal           0.394%   0.012%   0.543%    cache hit/miss (90/10)
Pareto  (α=1.5)  0.633%   0.000%   0.231%    heavy-tailed request sizes
Uniform           0.012%   1.035%   3.706%    synthetic benchmark
Sequent           0.000%   1.011%   3.696%    adversarial worst case
```

For the latency distributions that matter most (LN-API, LN-DB), WIDTH=3 delivers sub-0.2% error on ~2 KB of memory.

## Data Structures

### `Histogram<T>`

The main struct. `T` is optional user metadata per slot (defaults to `()`).

```rust
pub struct Histogram<T = ()> {
    log_scale: &'static LogScale,
    slots: SlotQueue<T>,
    aggregate: Slot<T>,
}
```

- `log_scale`: shared reference to precomputed lookup tables
- `slots`: historical time-window slots (up to `slot_limit - 1`)
- `aggregate`: running sum of all slot buckets (`.buckets`), cached total (`.total`), and current period metadata (`.data`)

### `LogScale`

Precomputed lookup tables for O(1) bucket operations.

```rust
pub struct LogScale {
    config: LogScaleConfig,
    bucket_min_values: Vec<u64>,
    small_value_buckets: Vec<u8>,
}
```

- `config`: bucket math parameters (width, group_size, mask, total bucket count)
- `bucket_min_values[i]`: left boundary of bucket `i`
- `small_value_buckets[v]`: cached bucket index for values 0–4095

Width is a runtime parameter (1..=16). Shared static instances are initialized once via `LazyLock` and accessed through `LogScale::get(width)`.

### `LogScaleConfig`

Configuration for logarithmic bucket boundaries.

```rust
pub struct LogScaleConfig {
    width: usize,
    group_size: usize,
    mask: u64,
    buckets: usize,
    small_value_cache_size: usize,
}
```

- `width`: bit-width parameter (1..=16). Each bucket group uses `width` bits: 1 MSB + (width-1) offset bits
- `group_size`: buckets per group = `2^(width-1)`
- `mask`: bitmask for offset extraction = `group_size - 1`
- `buckets`: total bucket count = `group_size × (66 - width)`
- `small_value_cache_size`: number of cached small-value lookups (4096)

### `Slot<T>`

A bucket array with cached aggregates, used both as the aggregate
accumulator and as individual time-window snapshots.

```rust
pub struct Slot<T> {
    buckets: Vec<u64>,
    region_sums: Vec<u64>,
    total: u64,
    data: Option<T>,
}
```

- `buckets`: per-bucket sample counts
- `region_sums`: cached sum for each group of 16 buckets, enabling tiered rank lookup
- `total`: cached total sample count across all buckets
- `data`: optional user metadata (e.g., timestamp, label)

Key methods: `record_n()`, `subtract()`, `clear()`, `bucket_at_rank()`.
`bucket_at_rank()` uses a two-phase scan — first over `region_sums`
(~16 entries), then within the matching region (~16 buckets) — reducing
percentile lookup from O(bucket_count) to O(bucket_count / 16).

### `Interpolator`

Stateless interpolation engine operating on bucket counts and geometry.

```rust
pub struct Interpolator<'a> {
    log_scale: &'a LogScale,
    buckets: &'a [u64],
}
```

Independent of any particular histogram instance — constructed from a `&LogScale` and a `&[u64]` bucket slice.

### `CumulativeCount`

Incremental cursor for computing cumulative counts at monotonically increasing positions.

```rust
pub struct CumulativeCount<'a> {
    interpolator: Interpolator<'a>,
    bucket_index: usize,
    accumulated: u64,
}
```

- `bucket_index`: current scan position, advances forward only
- `accumulated`: sum of whole-bucket counts before `bucket_index`

Each `count_below(position)` call resumes from where the previous call left off, giving O(1) amortized cost when queries are monotonically increasing. Used by the [rescale algorithm](#part-3-rescale--cdf-based-re-binning) to efficiently redistribute counts across bucket boundaries.

### `BucketSpan` / `BucketRef`

Lazy references to bucket geometry. `BucketSpan` provides `left()`, `right()`, `width()`, `midpoint()` via the LogScale lookup table. `BucketRef` adds the `count` field.

## Sliding Window

The histogram supports multiple slots for time-windowed aggregation.
With `slot_limit = N`, only `N - 1` slots are physically stored.
The current (newest) period is never materialized — its counts are derived on the fly as `aggregate - Σ stored`:

```
stored slots:  [a, b, c]              (N - 1 historical periods)
aggregate:     agg = a + b + c + d    (running total across all N periods)
implicit:      d = agg - a - b - c    (current period, not stored)
```

`record()` only touches the `aggregate`; individual slot writes are avoided entirely.

### Slot Rotation (`advance()`)

`advance(data)` materializes the implicit current period into a stored
historical slot, then starts a new empty current period. Returns the
evicted slot's metadata (`Option<T>`) if a slot was evicted, or `None`
during warmup:

**At capacity** (queue full):
```
1. evict oldest:  agg ← agg - a          (subtract oldest from aggregate)
2. materialize:   a   ← agg - b - c      (reuse evicted Vec for current period)
3. push:          slots: [b, c, a']       (a' now holds d's counts)
```

**During warmup** (queue not full):
```
1. allocate new Vec
2. materialize:   new ← agg - Σ stored
3. push:          slots: [...existing, new]
```

The evicted slot's `Vec<u64>` allocation is reused, avoiding repeated allocation.

This gives efficient sliding-window metrics without requiring the caller to track individual values or full per-window histograms.

## Part 3: Rescale — CDF-Based Re-Binning

Rescaling converts a histogram from one WIDTH to another, preserving all slots and the total sample count.

### Problem

Different WIDTHs produce different bucket boundaries.
A source bucket `[s_left, s_right)` may straddle multiple destination buckets.
Simply moving whole counts between buckets would lose resolution — the information about where samples fall *within* a bucket matters.

### Solution: CDF Transfer

For each destination bucket `[d_left, d_right)`, estimate how many source samples fall in that range using the trapezoidal CDF:

```
dst_count[i] = CDF(d_right) - CDF(d_left)
```

where `CDF(x)` is the estimated count of samples below `x`, computed via trapezoidal interpolation over the source buckets.

### Algorithm

```
1. Create a CumulativeCount cursor over the source buckets
2. prev_cdf ← 0
3. For each destination bucket i:
     cdf_right ← cursor.count_below(dst_bucket[i].right())
     dst_count[i] ← round(cdf_right) - prev_cdf
     prev_cdf ← round(cdf_right)
```

The `CumulativeCount` cursor maintains scan state, so each `count_below()` call resumes from the previous position — O(src_buckets) total work across all destination buckets, not O(src_buckets) per destination bucket.

The `round()` preserves integer totals: since each destination count is derived from the difference of rounded CDF values, the sum of all destination counts equals the original total.

### Rescale Scope

`rescale()` re-bins every stored component independently:
- The aggregate slot (running sum of all periods)
- Each historical slot in the queue

Each rebinned slot recomputes its `total` and `region_sums` from the new
buckets via `Slot::from_buckets()`. Slot metadata (`T`) is cloned.

## Complexity

| Operation | Time | Space |
|-----------|------|-------|
| Record a value | O(1) | — |
| Percentile query | O(bucket_count / 16) | — |
| Advance slot | O(bucket_count) | — |
| Rescale | O(src_buckets + dst_buckets) per slot | O(dst_buckets) |
| Memory per slot | — | `bucket_count × 8 + region_count × 8 + 8` bytes |
| Total (WIDTH=3, 1 slot) | — | ~2.1 KB |

Percentile queries use the tiered region scan in `bucket_at_rank()`:
~16 region sums + ~16 buckets = ~32 iterations instead of 252.
Measured at ~10–15 ns per query on 1M log-normal samples.