zeropool 0.2.0

High-performance buffer pool with constant-time allocation, thread-safe operations, and 5x speedup over bytes crate
Documentation

ZeroPool

A high-performance buffer pool for Rust with constant-time allocations

Crates.io Documentation License CI

Why ZeroPool?

Loading GPT-2 checkpoints took 200ms. Profiling showed 70% was buffer allocation, not I/O. With ZeroPool: 53ms (3.8x faster).

ZeroPool provides constant ~8ns allocation time regardless of buffer size, thread-safe operations, and near-linear multi-threaded scaling.

Quick Start

use zeropool::BufferPool;

let pool = BufferPool::new();

// Get a buffer
let mut buffer = pool.get(1024 * 1024); // 1MB

// Use it
file.read(&mut buffer)?;

// Return it
pool.put(buffer);

Performance Highlights

Metric Result
Allocation latency 14.6ns (constant, any size)
vs bytes (1MB) 2.6x faster
Multi-threaded (8 threads) 32 TiB/s
Multi-threaded speedup 2.1x faster than previous version
Real workload speedup 3.8x (GPT-2 loading)

Key Features

Constant-time allocation - ~15ns whether you need 1KB or 1MB

Thread-safe - Only 1ns slower than single-threaded pools, but actually concurrent

Auto-configured - Adapts to your CPU topology (4-core → 4 shards, 20-core → 32 shards)

Simple API - Just get() and put()

Architecture

Thread 1     Thread 2     Thread N
┌─────────┐  ┌─────────┐  ┌─────────┐
│ TLS (4) │  │ TLS (4) │  │ TLS (4) │  ← 75% hit rate
└────┬────┘  └────┬────┘  └────┬────┘
     └──────────┬─────────────┘
                ↓
       ┌────────────────┐
       │ Sharded Pool   │              ← Power-of-2 shards
       │ [0][1]...[N]   │              ← Minimal contention
       └────────────────┘

Fast path: Thread-local cache (lock-free, ~15ns) Slow path: Thread-affinity shard selection (better cache locality) Optimization: Power-of-2 shards enable bitwise AND instead of modulo

Configuration

use zeropool::BufferPool;

let pool = BufferPool::builder()
    .tls_cache_size(8)               // Buffers per thread
    .min_buffer_size(512 * 1024)     // Keep buffers ≥ 512KB
    .max_buffers_per_shard(32)       // Max pooled buffers
    .num_shards(16)                  // Override auto-detection
    .build();

Defaults (auto-configured based on CPU count):

  • Shards: 4-128 (power-of-2, ~1 shard per 2 cores)
  • TLS cache: 2-8 buffers per thread
  • Min buffer size: 1MB
  • Max per shard: 16-64 buffers

Memory Pinning

Lock buffer memory in RAM to prevent swapping:

use zeropool::BufferPool;

let pool = BufferPool::builder()
    .pinned_memory(true)
    .build();

Useful for high-performance computing, security-sensitive data, or real-time systems. May require elevated privileges on some systems. Falls back gracefully if pinning fails.

Benchmarks

Single-Threaded (allocation + deallocation)

Size ZeroPool No Pool Lifeguard Sharded-Slab Bytes
1KB 14.9ns 7.7ns 7.0ns 50.0ns 8.4ns
64KB 14.6ns 46.3ns 6.9ns 88.3ns 49.1ns
1MB 14.6ns 37.7ns 6.9ns 80.2ns 39.4ns

Constant latency across all sizes. 2.6x faster than bytes for large buffers (1MB).

Multi-Threaded (throughput)

Threads ZeroPool No Pool Sharded-Slab Speedup vs Previous
2 14.2 TiB/s 2.7 TiB/s 1.3 TiB/s 1.38x
4 25.0 TiB/s 5.1 TiB/s 2.6 TiB/s 1.34x
8 32.0 TiB/s 7.7 TiB/s 3.9 TiB/s 1.14x

Near-linear scaling with thread-local shard affinity. 8.2x faster than sharded-slab at 8 threads.

Real-World Pattern

Buffer reuse (1MB buffer, 1000 get/put cycles):

  • ZeroPool: 6.1µs total (6.1ns/op)
  • No pool: 600ns/op
  • Lifeguard: 172ns/op

~98x faster than no pooling for realistic buffer reuse patterns.

Run yourself:

cargo bench

Test System

  • CPU: Intel i9-10900K @ 3.7GHz (10 cores, 20 threads, 5.3GHz turbo)
  • RAM: 32GB DDR4
  • OS: Linux 6.14.0

How It Works

Thread-local caching (75% hit rate)

  • Lock-free access to recently used buffers
  • No atomic operations on fast path
  • Zero cache-line bouncing

Thread-local shard affinity

  • Each thread consistently uses the same shard (cache locality)
  • shard = hash(thread_id) & (num_shards - 1) (no modulo)
  • Minimal lock contention + better CPU cache utilization
  • Auto-scales with CPU count

First-fit allocation

  • O(1) instead of O(n) best-fit
  • Perfect for predictable I/O buffer sizes

Zero-fill skip

  • Safe unsafe via capacity-checked set_len()
  • I/O immediately overwrites buffers
  • ~40% of the speedup vs Vec::with_capacity

Thread Safety

BufferPool is Clone and thread-safe:

let pool = BufferPool::new();

for _ in 0..4 {
    let pool = pool.clone();
    std::thread::spawn(move || {
        let buf = pool.get(1024);
        // Each thread gets its own TLS cache
        pool.put(buf);
    });
}

Safety Guarantees

Uses unsafe internally to skip zero-fills:

// SAFE: capacity checked before set_len()
if buffer.capacity() >= size {
    unsafe { buffer.set_len(size); }
}

Why this is safe:

  • Capacity verified before length modification
  • Buffers immediately overwritten by I/O operations
  • No reads before writes in typical I/O patterns

All unsafe code is documented and audited.

Use Cases

High-performance I/O - io_uring, async file loading, network buffers

LLM inference - Fast checkpoint loading (real-world: 200ms → 53ms for GPT-2)

Data processing - Predictable buffer sizes, high allocation rate

Multi-threaded servers - Concurrent buffer allocation without contention

License

Dual licensed under Apache-2.0 or MIT.

Contributing

PRs welcome! Please include benchmarks for performance changes.