zeropool 0.3.1

High-performance buffer pool with constant-time allocation, thread-safe operations, and 5x speedup over bytes crate
Documentation
# ZeroPool

> A high-performance buffer pool for Rust - Performance First

[![Crates.io](https://img.shields.io/crates/v/zeropool.svg)](https://crates.io/crates/zeropool)
[![Documentation](https://docs.rs/zeropool/badge.svg)](https://docs.rs/zeropool)
[![License](https://img.shields.io/crates/l/zeropool.svg)](https://github.com/botirk38/zeropool/blob/main/LICENSE)
[![CI](https://github.com/botirk38/zeropool/workflows/CI/badge.svg)](https://github.com/botirk38/zeropool/actions)

## Why ZeroPool?

ZeroPool is a **high-performance buffer pool** that prioritizes speed above all else. Unlike traditional buffer pools that trade performance for security features, ZeroPool delivers maximum performance:

- **Performance-first**: No memory zeroing by default (1000-10000x faster than zeroing pools)
- **Safe Rust**: No unsafe memory operations, only safe abstractions
- **High performance**: Thread-local caching and smart allocation strategies minimize overhead
- **Auto-configured**: Adapts to your CPU topology for optimal multi-threaded performance

Perfect for high-throughput applications where raw speed is the primary requirement.

## Quick Start

```rust
use zeropool::BufferPool;

let pool = BufferPool::new();

// Get a buffer (high-performance, not zeroed by default)
let mut buffer = pool.get(1024 * 1024); // 1MB

// Use it for I/O or data processing
file.read(&mut buffer)?;

// Zero manually if needed for security
buffer.fill(0);

// Buffer automatically returned to pool when dropped
```

## Key Features

### High Performance โšก

- **Extreme speed**: 1000-10000x faster than zeroing buffer pools
- **Thread-local caching**: Lock-free fast path for 60-110ns allocation latency
- **Smart sharding**: Minimal contention with power-of-2 shard count
- **Auto-configured**: CPU-aware defaults (4-128 shards, 2-8 TLS cache size)
- **Configurable eviction**: Choose between LIFO or CLOCK-Pro algorithms

### Simple API ๐ŸŽฏ

- **Just `get()` and `drop()`**: Buffers automatically return to the pool
- **Builder pattern**: Easy customization when needed
- **Type-safe**: Leverages Rust's ownership for automatic resource management

## Architecture

```
Thread 1     Thread 2     Thread N
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ TLS (4) โ”‚  โ”‚ TLS (4) โ”‚  โ”‚ TLS (4) โ”‚  โ† Lock-free (60-110ns)
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚ Sharded Pool   โ”‚              โ† Thread affinity
       โ”‚ [0][1]...[N]   โ”‚              โ† Minimal contention
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

**Fast path**: Thread-local cache (lock-free, ~60-110ns)
**Slow path**: Thread-affinity shard selection (better cache locality)
**Optimization**: Power-of-2 shards enable bitwise AND instead of modulo

## Performance

### Cache Behavior Benchmarks

| Pattern | Metric | Result |
|---------|--------|--------|
| Ping-pong (LIFO) | Time per operation | 3.56 ยตs |
| Ping-pong (ClockPro) | Time per operation | 3.68 ยตs |
| Hot/cold buffers | Time per operation | 1.05 ยตs |
| Multi-size workload | Time per operation | 6.2 ยตs |
| TLS cache (2 bufs) | Allocation latency | 60.5 ns |
| TLS cache (4 bufs) | Allocation latency | 108 ns |
| TLS cache (8 bufs) | Allocation latency | 288 ns |
| Eviction pressure | Time per operation | 400 ns |

### Multi-threaded Scaling

| Threads | Time per 1000 ops | Notes |
|---------|-------------------|-------|
| 1 | 44.7 ยตs | Single-threaded baseline |
| 4 | 141 ยตs | Good scaling with TLS cache |
| 8 | 282 ยตs | Near-linear scaling |
| 16 | 605 ยตs | Still scales well at high concurrency |

### Performance Characteristics

- **Extreme speed**: 1000-10000x faster than zeroing buffer pools
- **Constant latency**: 60-110ns for TLS cache hits regardless of buffer size
- **Lock-free fast path**: Thread-local cache eliminates contention
- **Scales linearly**: Near-linear scaling up to 16+ threads

**Run yourself**:
```bash
cargo bench
```

### Test System
- CPU: Intel i9-10900K @ 3.7GHz (10 cores, 20 threads, 5.3GHz turbo)
- RAM: 32GB DDR4
- OS: Linux 6.17.0

## Configuration

```rust
use zeropool::BufferPool;

let pool = BufferPool::builder()
    .tls_cache_size(8)               // Buffers per thread
    .min_buffer_size(512 * 1024)     // Keep buffers โ‰ฅ 512KB
    .max_buffers_per_shard(32)       // Max pooled buffers
    .num_shards(16)                  // Override auto-detection
    .build();
```

**Defaults** (auto-configured based on CPU count):
- Shards: 4-128 (power-of-2, ~1 shard per 2 cores)
- TLS cache: 2-8 buffers per thread
- Min buffer size: 1MB
- Max per shard: 16-64 buffers

### Memory Pinning

Lock buffer memory in RAM to prevent swapping (performance optimization):

```rust
use zeropool::BufferPool;

let pool = BufferPool::builder()
    .pinned_memory(true)
    .build();
```

Useful for high-performance computing or real-time systems. May require elevated privileges on some systems. Falls back gracefully if pinning fails.

### Eviction Policy

Choose between simple LIFO or intelligent CLOCK-Pro buffer eviction:

```rust
use zeropool::{BufferPool, EvictionPolicy};

let pool = BufferPool::builder()
    .eviction_policy(EvictionPolicy::ClockPro)  // Better cache locality (default)
    .build();

let pool_lifo = BufferPool::builder()
    .eviction_policy(EvictionPolicy::Lifo)     // Simple, lowest overhead
    .build();
```

**CLOCK-Pro (default)**: Uses access counters to favor recently-used buffers, preventing cache thrashing in mixed-size workloads. ~8 bytes overhead per buffer.

**LIFO**: Simple last-in-first-out eviction. Minimal memory overhead, best for uniform buffer sizes.

## How It Works

**Thread-local caching** (lock-free)
- Lock-free access to recently used buffers
- No atomic operations on fast path (60-110ns latency)
- Zero cache-line bouncing

**Thread-local shard affinity**
- Each thread consistently uses the same shard (cache locality)
- `shard = hash(thread_id) & (num_shards - 1)` (no modulo)
- Minimal lock contention + better CPU cache utilization
- Auto-scales with CPU count

**First-fit allocation**
- O(1) instead of O(n) best-fit
- Perfect for predictable I/O buffer sizes

**Performance-first memory management**
- Buffers are not zeroed by default for maximum performance (1000-10000x faster)
- Users can manually zero buffers if information leakage prevention is required
- Safe for performance-critical workloads where security is handled at higher layers

## Thread Safety

`BufferPool` is `Clone` and thread-safe:

```rust
let pool = BufferPool::new();

for _ in 0..4 {
    let pool = pool.clone();
    std::thread::spawn(move || {
        let buf = pool.get(1024);
        // Each thread gets its own TLS cache
        // Buffer automatically returned when dropped
    });
}
```

## Use Cases

### High-Performance Applications

- **Data processing**: ETL pipelines, log processing, analytics
- **Network servers**: HTTP, gRPC, WebSocket servers with high throughput
- **File I/O**: Async file loading with io_uring, tokio, async-std
- **LLM inference**: Fast checkpoint loading and model serving
- **Real-time systems**: Low-latency buffer management
- **Big data**: High-throughput data streaming and processing

### Real-World Example

Before ZeroPool, loading GPT-2 checkpoints took **200ms** with 70% spent on buffer allocation. With ZeroPool: **53ms** (3.8x faster) while delivering maximum performance without security overhead.

## System Scaling

ZeroPool automatically adapts to your system:

| System | Cores | TLS Cache | Shards | Buffers/Shard | Total Capacity |
|--------|-------|-----------|--------|---------------|----------------|
| Embedded | 4 | 4 | 4 | 16 | 64 (~64MB) |
| Laptop | 8 | 6 | 8 | 16 | 128 (~128MB) |
| Workstation | 16 | 6 | 8 | 32 | 256 (~256MB) |
| Small Server | 32 | 8 | 16 | 64 | 1024 (~1GB) |
| Large Server | 64 | 8 | 32 | 64 | 2048 (~2GB) |
| Supercompute | 128 | 8 | 64 | 64 | 4096 (~4GB) |

## Comparison with Alternatives

| Feature | ZeroPool | bytes::BytesMut | Lifeguard | Sharded-Slab |
|---------|----------|-----------------|-----------|--------------|
| Memory zeroing | โŒ No (performance-first) | โŒ No | โŒ No | โŒ No |
| Safe Rust | โœ… 100% | โš ๏ธ Some unsafe | โš ๏ธ Some unsafe | โš ๏ธ Heavy unsafe |
| Thread-safe | โœ… Yes | โŒ No | โš ๏ธ Limited | โœ… Yes |
| Lock-free path | โœ… TLS cache | โŒ No | โŒ No | โš ๏ธ Partial |
| Auto-configured | โœ… CPU-aware | โŒ Manual | โŒ Manual | โŒ Manual |
| Performance focus | โœ… Primary | โŒ No | โŒ No | โŒ No |

ZeroPool is the fastest buffer pool available, designed purely for **maximum performance** while maintaining safety.

## License

Dual licensed under Apache-2.0 or MIT.

## Contributing

PRs welcome! Please include benchmarks for performance changes and ensure all tests pass:

```bash
cargo test
cargo bench
cargo fmt
cargo clippy
```

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for version history.

## Credits

Built with โค๏ธ for the Rust community. Inspired by the need for high-performance buffer management in production systems.