# NUMAlloc
**A blazing-fast, NUMA-aware memory allocator written in pure Rust.**
NUMAlloc is a drop-in replacement for the global allocator, purpose-built for Non-Uniform Memory Access (NUMA) machines. It pins threads and memory to NUMA nodes, routes freed objects back to their origin node, and shares huge pages incrementally -- delivering fewer remote memory accesses, fewer TLB misses, and lower latency than general-purpose allocators.
> **Note:** This project has not been tested in production. Any feedback is welcome — please use with caution.
## Features
- **Zero-cost NUMA awareness** - O(1) origin-node lookup via pointer arithmetic, no syscalls on the hot path
- **Origin-aware deallocation** - freed objects return to their origin node's freelist, eliminating remote reuse
- **Lock-free concurrency** - Treiber stacks with ABA-safe generation tags for inter-thread communication
- **Incremental huge page sharing** - threads on the same node share 2 MB transparent huge pages in 32 KB–256 KB increments
- **Memory safe** - built with Rust's ownership model; every `unsafe` block is documented with safety invariants
- **Minimal dependencies** - only `libc` for POSIX syscalls, no proc macros, no heavy frameworks
- **Drop-in `GlobalAlloc`** - one line to replace your allocator
## Quick Start
Add to your `Cargo.toml`:
```toml
[dependencies]
numalloc = "0.1"
```
Set as global allocator:
```rust
use numalloc::NumaAlloc;
#[global_allocator]
static ALLOC: NumaAlloc = NumaAlloc::new();
fn main() {
// All allocations now go through NUMAlloc.
let v: Vec<u64> = vec![1, 2, 3];
println!("{v:?}");
}
```
## Architecture
NUMAlloc uses a five-layer allocation hierarchy, from fastest to slowest:
```mermaid
graph LR
A["Thread Cache<br/>(lock-free)"] --> B["Node Heap<br/>(lock-free CAS)"]
B --> C["Region<br/>(atomic bump)"]
C --> D["Large Cache<br/>(per-thread)"]
D --> E["OS mmap"]
```
### Allocation path (small objects, <= 256 KB)
```mermaid
graph TD
A["malloc(size)"] --> B["Per-Thread Freelist"]
B -- hit --> C["return pointer<br/>(zero synchronization)"]
B -- miss --> D["Per-Node Treiber Stack"]
D -- hit --> E["batch-pop up to 64 objects,<br/>return one"]
D -- miss --> F["Region Bump Allocator"]
F --> G["carve bag (32 KB–256 KB),<br/>fill freelist, return one"]
```
### Allocation path (large objects, > 256 KB)
```mermaid
graph TD
A["malloc(size)"] --> B["Per-Thread Large Cache"]
B -- hit --> C["reuse cached mmap region<br/>(~7.5 ns)"]
B -- miss --> D["OS mmap"]
D --> E["fresh mapping,<br/>mbind to thread's node"]
```
### Deallocation path
```mermaid
graph TD
A["free(ptr)"] --> B["Compute origin node<br/>(ptr - base) / region_size<br/>O(1), no syscall"]
B --> C{"Object type?"}
C -- "small, local" --> D["push to per-thread freelist<br/>(no locks)"]
C -- "small, remote" --> E["push to origin node's<br/>Treiber stack (lock-free CAS)"]
C -- "large (> 256KB)" --> F["cache mmap region for reuse<br/>(evict old if full)"]
```
### Heap layout
A single contiguous virtual region is mapped at init and divided equally among NUMA nodes. Each sub-region is bound to its physical node via `mbind`. This design enables origin-node identification through simple integer division on any pointer.
```mermaid
block-beta
columns 8
block:n0["Node 0"]:2
s0["small"]
b0["big"]
end
block:n1["Node 1"]:2
s1["small"]
b1["big"]
end
block:n2["Node 2"]:2
s2["small"]
b2["big"]
end
block:nN["Node N"]:2
sN["small"]
bN["big"]
end
bp0["bpSmall"] space bp1["bpSmall"] space bp2["bpSmall"] space bpN["bpSmall"] space
bp0 --> s0
bp1 --> s1
bp2 --> s2
bpN --> sN
```
For the full design document with Mermaid diagrams and benchmark details, see [docs/architecture_design.md](docs/architecture_design.md).
## Benchmarks
Single-threaded alloc+dealloc (steady state, lower is better):
| 8 B | **7.4 ns** | 13.8 ns | 10.9 ns | 9.2 ns |
| 64 B | **7.4 ns** | 15.4 ns | 11.1 ns | 9.4 ns |
| 1 KB | **7.4 ns** | 31.2 ns | 15.7 ns | 9.8 ns |
| 16 KB | **7.3 ns** | 31.7 ns | 19.3 ns | 23.2 ns |
| 64 KB | **7.7 ns** | 692 ns | 19.4 ns | 104 ns |
| 256 KB | **8.4 ns** | 707 ns | 956 ns | 105 ns |
Bulk alloc+free (1000 items, single-threaded, lower is better):
| 64 B | **8.2 us** | 16.8 us | 8.0 us | 17.1 us |
| 4 KB | **24.3 us** | 88.1 us | 24.4 us | 46.5 us |
| 64 KB | 57.2 us | 709 us | 43.5 us | 508 us |
| 256 KB | **37.4 us** | 704 us | 155 us | 545 us |
Multi-threaded alloc+dealloc (10,000 ops/thread, lower is better):
| 64 B, 4 threads | **153 us** | 1.2 ms | 172 us | 158 us |
| 1 KB, 8 threads | **213 us** | 2.9 ms | 293 us | 221 us |
| 4 KB, 8 threads | **213 us** | 3.3 ms | 337 us | 245 us |
Axum HTTP server benchmark (4 threads, 100 connections, 10s, higher is better):
| `/small` ~32 B | **49,124 rps** | 48,286 rps | 48,310 rps |
| `/medium` ~256 B | **49,135 rps** | 48,892 rps | 49,017 rps |
| `/large` ~16 KB | **46,600 rps** | 46,244 rps | 46,511 rps |
| `/bulk` ~64 KB | **37,558 rps** | 35,121 rps | 37,155 rps |
| system | **11 MB** |
| numalloc | 16 MB |
| mimalloc | 27 MB |
Run benchmarks yourself with `cargo bench` or `cd examples/axum-bench && bash bench.sh`.
## Design Principles
- **Hot path = zero synchronization.** Per-thread freelists are single-owner, no atomics, no syscalls.
- **Lock-free over locks.** Shared per-node heaps use Treiber stacks with CAS, not mutexes.
- **Batch everything.** Drain and refill operations move objects in bulk via chain-insert, amortizing CAS cost.
- **Adaptive caching.** Per-thread cache thresholds scale with object size -- small objects cache up to 2048 items, large objects cache 64.
- **Intrusive data structures.** Free blocks store their `next` pointer in the freed memory itself -- zero extra overhead.
- **Explicit safety.** Every `unsafe` block carries a `// SAFETY:` comment. `NonNull<T>` over `*mut T`. `UnsafeCell` for interior mutability.
## When to Use NUMAlloc
**Good fit:**
- Multi-threaded server applications on NUMA hardware (2+ sockets)
- Workloads with many small allocations and high thread counts
- Environments with transparent huge pages enabled
**Not ideal for:**
- Single-threaded applications
- Heavy cross-node producer-consumer patterns
- Asymmetric or heterogeneous memory topologies
## Contributing
Contributions are welcome! Please ensure your changes pass:
```sh
cargo fmt
cargo clippy -- -D warnings
cargo test
```
## License
See [LICENSE](LICENSE) for details.
## References
Based on: Yang et al., *"NUMAlloc: A Faster NUMA Memory Allocator,"* ISMM 2023, ACM.