numalloc 0.1.0 - Docs.rs

# NUMAlloc

**A blazing-fast, NUMA-aware memory allocator written in pure Rust.**

NUMAlloc is a drop-in replacement for the global allocator, purpose-built for Non-Uniform Memory Access (NUMA) machines. It pins threads and memory to NUMA nodes, routes freed objects back to their origin node, and shares huge pages incrementally -- delivering fewer remote memory accesses, fewer TLB misses, and lower latency than general-purpose allocators.

> **Note:** This project has not been tested in production. Any feedback is welcome — please use with caution.

## Features

- **Zero-cost NUMA awareness** - O(1) origin-node lookup via pointer arithmetic, no syscalls on the hot path
- **Origin-aware deallocation** - freed objects return to their origin node's freelist, eliminating remote reuse
- **Lock-free concurrency** - Treiber stacks with ABA-safe generation tags for inter-thread communication
- **Incremental huge page sharing** - threads on the same node share 2 MB transparent huge pages in 32 KB–256 KB increments
- **Memory safe** - built with Rust's ownership model; every `unsafe` block is documented with safety invariants
- **Minimal dependencies** - only `libc` for POSIX syscalls, no proc macros, no heavy frameworks
- **Drop-in `GlobalAlloc`** - one line to replace your allocator

## Quick Start

Add to your `Cargo.toml`:

```toml
[dependencies]
numalloc = "0.1"
```

Set as global allocator:

```rust
use numalloc::NumaAlloc;

#[global_allocator]
static ALLOC: NumaAlloc = NumaAlloc::new();

fn main() {
    // All allocations now go through NUMAlloc.
    let v: Vec<u64> = vec![1, 2, 3];
    println!("{v:?}");
}
```

## Architecture

NUMAlloc uses a five-layer allocation hierarchy, from fastest to slowest:

```mermaid
graph LR
    A["Thread Cache<br/>(lock-free)"] --> B["Node Heap<br/>(lock-free CAS)"]
    B --> C["Region<br/>(atomic bump)"]
    C --> D["Large Cache<br/>(per-thread)"]
    D --> E["OS mmap"]
```

### Allocation path (small objects, <= 256 KB)

```mermaid
graph TD
    A["malloc(size)"] --> B["Per-Thread Freelist"]
    B -- hit --> C["return pointer<br/>(zero synchronization)"]
    B -- miss --> D["Per-Node Treiber Stack"]
    D -- hit --> E["batch-pop up to 64 objects,<br/>return one"]
    D -- miss --> F["Region Bump Allocator"]
    F --> G["carve bag (32 KB–256 KB),<br/>fill freelist, return one"]
```

### Allocation path (large objects, > 256 KB)

```mermaid
graph TD
    A["malloc(size)"] --> B["Per-Thread Large Cache"]
    B -- hit --> C["reuse cached mmap region<br/>(~7.5 ns)"]
    B -- miss --> D["OS mmap"]
    D --> E["fresh mapping,<br/>mbind to thread's node"]
```

### Deallocation path

```mermaid
graph TD
    A["free(ptr)"] --> B["Compute origin node<br/>(ptr - base) / region_size<br/>O(1), no syscall"]
    B --> C{"Object type?"}
    C -- "small, local" --> D["push to per-thread freelist<br/>(no locks)"]
    C -- "small, remote" --> E["push to origin node's<br/>Treiber stack (lock-free CAS)"]
    C -- "large (> 256KB)" --> F["cache mmap region for reuse<br/>(evict old if full)"]
```

### Heap layout

A single contiguous virtual region is mapped at init and divided equally among NUMA nodes. Each sub-region is bound to its physical node via `mbind`. This design enables origin-node identification through simple integer division on any pointer.

```mermaid
block-beta
    columns 8
    block:n0["Node 0"]:2
        s0["small"]
        b0["big"]
    end
    block:n1["Node 1"]:2
        s1["small"]
        b1["big"]
    end
    block:n2["Node 2"]:2
        s2["small"]
        b2["big"]
    end
    block:nN["Node N"]:2
        sN["small"]
        bN["big"]
    end
    bp0["bpSmall"] space bp1["bpSmall"] space bp2["bpSmall"] space bpN["bpSmall"] space
    bp0 --> s0
    bp1 --> s1
    bp2 --> s2
    bpN --> sN
```

For the full design document with Mermaid diagrams and benchmark details, see [docs/architecture_design.md](docs/architecture_design.md).

## Benchmarks

Single-threaded alloc+dealloc (steady state, lower is better):

| Size   | numalloc   | system (glibc) | mimalloc | jemalloc |
|--------|------------|----------------|----------|----------|
| 8 B    | **7.4 ns** | 13.8 ns        | 10.9 ns  | 9.2 ns   |
| 64 B   | **7.4 ns** | 15.4 ns        | 11.1 ns  | 9.4 ns   |
| 1 KB   | **7.4 ns** | 31.2 ns        | 15.7 ns  | 9.8 ns   |
| 16 KB  | **7.3 ns** | 31.7 ns        | 19.3 ns  | 23.2 ns  |
| 64 KB  | **7.7 ns** | 692 ns         | 19.4 ns  | 104 ns   |
| 256 KB | **8.4 ns** | 707 ns         | 956 ns   | 105 ns   |

Bulk alloc+free (1000 items, single-threaded, lower is better):

| Size   | numalloc        | system  | mimalloc | jemalloc |
|--------|-----------------|---------|----------|----------|
| 64 B   | **8.2 us**      | 16.8 us | 8.0 us   | 17.1 us  |
| 4 KB   | **24.3 us**     | 88.1 us | 24.4 us  | 46.5 us  |
| 64 KB  | 57.2 us         | 709 us  | 43.5 us  | 508 us   |
| 256 KB | **37.4 us**     | 704 us  | 155 us   | 545 us   |

Multi-threaded alloc+dealloc (10,000 ops/thread, lower is better):

| Config          | numalloc   | system  | mimalloc | jemalloc |
|-----------------|------------|---------|----------|----------|
| 64 B, 4 threads | **153 us** | 1.2 ms  | 172 us   | 158 us   |
| 1 KB, 8 threads | **213 us** | 2.9 ms  | 293 us   | 221 us   |
| 4 KB, 8 threads | **213 us** | 3.3 ms  | 337 us   | 245 us   |

Axum HTTP server benchmark (4 threads, 100 connections, 10s, higher is better):

| Endpoint         | numalloc       | system         | mimalloc       |
|------------------|----------------|----------------|----------------|
| `/small` ~32 B   | **49,124 rps** | 48,286 rps     | 48,310 rps     |
| `/medium` ~256 B | **49,135 rps** | 48,892 rps     | 49,017 rps     |
| `/large` ~16 KB  | **46,600 rps** | 46,244 rps     | 46,511 rps     |
| `/bulk` ~64 KB   | **37,558 rps** | 35,121 rps     | 37,155 rps     |

| Allocator | RSS after load |
|-----------|----------------|
| system    | **11 MB**      |
| numalloc  | 16 MB          |
| mimalloc  | 27 MB          |

Run benchmarks yourself with `cargo bench` or `cd examples/axum-bench && bash bench.sh`.

## Design Principles

- **Hot path = zero synchronization.** Per-thread freelists are single-owner, no atomics, no syscalls.
- **Lock-free over locks.** Shared per-node heaps use Treiber stacks with CAS, not mutexes.
- **Batch everything.** Drain and refill operations move objects in bulk via chain-insert, amortizing CAS cost.
- **Adaptive caching.** Per-thread cache thresholds scale with object size -- small objects cache up to 2048 items, large objects cache 64.
- **Intrusive data structures.** Free blocks store their `next` pointer in the freed memory itself -- zero extra overhead.
- **Explicit safety.** Every `unsafe` block carries a `// SAFETY:` comment. `NonNull<T>` over `*mut T`. `UnsafeCell` for interior mutability.

## When to Use NUMAlloc

**Good fit:**
- Multi-threaded server applications on NUMA hardware (2+ sockets)
- Workloads with many small allocations and high thread counts
- Environments with transparent huge pages enabled

**Not ideal for:**
- Single-threaded applications
- Heavy cross-node producer-consumer patterns
- Asymmetric or heterogeneous memory topologies

## Contributing

Contributions are welcome! Please ensure your changes pass:

```sh
cargo fmt
cargo clippy -- -D warnings
cargo test
```

## License

See [LICENSE](LICENSE) for details.

## References

Based on: Yang et al., *"NUMAlloc: A Faster NUMA Memory Allocator,"* ISMM 2023, ACM.