numalloc 0.1.2 - Docs.rs

# NUMAlloc: Architecture Design Document

## 1. Overview

NUMAlloc is a replacement memory allocator purpose-built for Non-Uniform Memory Access (NUMA) machines. It intercepts standard `malloc`/`free` calls via `LD_PRELOAD`, so applications need zero source-code changes. The allocator is built around three core ideas:

1. **Binding-based memory management** — threads and memory regions are pinned to NUMA nodes at startup.
2. **Origin-aware allocation and deallocation** — every heap object carries an implicit "origin node," and freed objects always return to their origin node's freelist.
3. **Incremental sharing of huge pages** — threads on the same node share transparent huge pages (THP) in small increments, combining TLB-miss reduction with low memory waste.

The result: 15.7% faster than mimalloc, 20.9% faster than glibc, with 9× fewer remote memory accesses and 18× fewer TLB misses on average across 25 benchmarks.

---

## 2. The Problem: Why NUMA Needs a Dedicated Allocator

### What is NUMA?

In a NUMA machine, each CPU socket (node) has its own local memory. Accessing local memory is fast; accessing another node's memory ("remote access") is 2–3× slower. A typical server might have 2–8 nodes, each with 16–32 cores.

### Why existing allocators fail on NUMA

Traditional allocators (glibc, TCMalloc, jemalloc, mimalloc) were designed for uniform memory. They cause four categories of NUMA performance problems:

| Problem                       | Root cause                                                                                         | Impact                                                    |
|-------------------------------|----------------------------------------------------------------------------------------------------|-----------------------------------------------------------|
| **Expensive node lookups**    | Checking which node a thread runs on requires a syscall (~10,000 cycles)                           | Allocators avoid checking, so they guess wrong            |
| **Blind deallocation**        | Freed objects go to the deallocating thread's cache regardless of which node the memory belongs to | Reuse of freed memory triggers remote accesses            |
| **Thread migration**          | The OS can move a thread to a different node mid-execution                                         | All of a thread's "local" memory becomes remote overnight |
| **No huge page coordination** | Allocators either ignore THP or give each thread a private 2MB superblock                          | Wasted memory or missed TLB optimization                  |

NUMAlloc tackles all four.

---

## 3. High-Level Architecture

The system is composed of four layers, from hardware up to the application:

```mermaid
graph TB
    subgraph Application Layer
        A[Application Code<br/>malloc / free / realloc]
    end

    subgraph NUMAlloc Interceptor
        B[LD_PRELOAD Interception<br/>Redirects all alloc/dealloc calls]
    end

    subgraph NUMAlloc Core
        C[Thread Binding Manager]
        D[Origin-Aware Allocator]
        E[Incremental Sharing Engine]
        F[Object Movement Engine]
    end

    subgraph OS / Hardware
        G[mbind / set_mempolicy syscalls]
        H[NUMA Hardware Topology<br/>Nodes, Cores, Interconnects]
    end

    A --> B
    B --> C
    B --> D
    B --> E
    C --> G
    D --> F
    E --> G
    G --> H
```

---

## 4. Heap Layout

NUMAlloc's heap is a single large contiguous virtual memory region, divided equally among all NUMA nodes. This layout is the foundation for every other design decision.

```mermaid
graph LR
    subgraph "Contiguous Virtual Memory Region"
        subgraph Node_1["Node 1 Region"]
            S1[Small Objects<br/>bpSmall pointer]
            B1[Big Objects<br/>bpBig pointer]
        end
        subgraph Node_2["Node 2 Region"]
            S2[Small Objects<br/>bpSmall pointer]
            B2[Big Objects<br/>bpBig pointer]
        end
        subgraph Node_N["Node N Region"]
            SN[Small Objects<br/>bpSmall pointer]
            BN[Big Objects<br/>bpBig pointer]
        end
    end
```

### Key properties

- **Each region is bound** to its corresponding physical node via the `mbind` syscall. Region 1 → Node 1, Region 2 → Node 2, etc.
- **Node lookup is a division**: given a pointer `p`, its origin node = `(p - heap_base) / region_size`. No syscall needed.
- **Small objects** (up to 256 KB) are managed in BiBOP-style bags (32 KB for objects up to 16 KB; larger classes use a bag equal to the object size), each holding objects of one size class.
- **Big objects** (> 256 KB) are backed by individual `mmap` regions with a per-thread large object cache to avoid repeated syscalls.

### Per-Thread and Per-Node Heaps

Each thread gets a private per-thread heap (no locks needed). Each node has a shared per-node heap (lock-protected, but accessed less frequently).

```mermaid
graph TD
    subgraph "Node 1"
        direction TB
        T1["Thread 0<br/>PerThreadHeap"]
        T2["Thread 2<br/>PerThreadHeap"]
        T3["Thread 4<br/>PerThreadHeap"]
        NH1["PerNodeHeap<br/>(shared, lock-protected)"]
        UR1["Unallocated Region<br/>bpSmall / bpBig"]
    end

    subgraph "Node 2"
        direction TB
        T4["Thread 1<br/>PerThreadHeap"]
        T5["Thread 3<br/>PerThreadHeap"]
        T6["Thread 5<br/>PerThreadHeap"]
        NH2["PerNodeHeap<br/>(shared, lock-protected)"]
        UR2["Unallocated Region<br/>bpSmall / bpBig"]
    end

    T1 -->|overflow| NH1
    T2 -->|overflow| NH1
    T3 -->|overflow| NH1
    NH1 -->|refill| UR1

    T4 -->|overflow| NH2
    T5 -->|overflow| NH2
    T6 -->|overflow| NH2
    NH2 -->|refill| UR2
```

Each PerThreadHeap has one freelist per size class. When a per-thread freelist overflows (too many freed objects), it drains to the per-node heap. When it's empty, it refills from the per-node heap or from unallocated memory.

---

## 5. Core Component: Binding-Based Memory Management

### Thread Binding

At thread creation, NUMAlloc intercepts `pthread_create` and pins the new thread to a specific NUMA node (not a specific core — the OS can still schedule within the node).

```mermaid
flowchart TD
    START([Thread Created]) --> INTERCEPT[NUMAlloc intercepts<br/>pthread_create]
    INTERCEPT --> POLICY{Binding policy?}
    POLICY -->|Node-interleaved<br/>default| INTERLEAVE["Bind to node = thread_id % num_nodes<br/>Threads spread evenly across nodes"]
    POLICY -->|Node-saturate| SATURATE["Fill node 0 first,<br/>then node 1, etc."]
    POLICY -->|Custom| CUSTOM["Read binding from<br/>config file"]
    INTERLEAVE --> BOUND[Thread bound to node<br/>via sched_setaffinity]
    SATURATE --> BOUND
    CUSTOM --> BOUND
    BOUND --> RUN([Thread runs<br/>always on its assigned node])
```

**Node-interleaved binding** (default): Thread 0 → Node 0, Thread 1 → Node 1, Thread 2 → Node 0, Thread 3 → Node 1, ... This distributes work evenly. Benchmarks show it's ~19% faster than node-saturate for full-core utilization.

**Node-saturate binding**: Fill one node completely before moving to the next. Better when using fewer cores than available.

### Benefits of binding

Once a thread is bound, NUMAlloc knows its node from a thread-local variable — no expensive `getcpu()` / `get_mempolicy()` syscalls. This knowledge powers everything else.

---

## 6. Core Component: Origin-Aware Memory Management

This is the mechanism that eliminates remote accesses during both allocation and deallocation.

### Allocation Path

```mermaid
flowchart TD
    REQ([malloc request]) --> SIZE{Object size?}
    SIZE -->|"Small (≤ 256KB)"| PTH["Check PerThreadHeap<br/>freelist for this size class"]
    SIZE -->|"Big (> 256KB)"| BIG["Check PerThreadHeap<br/>large object cache"]

    PTH --> AVAIL1{Objects<br/>available?}
    AVAIL1 -->|Yes| RETURN1([Return object<br/>from thread cache])
    AVAIL1 -->|No| PNH["Check PerNodeHeap<br/>freelist for this size class"]

    PNH --> AVAIL2{Objects<br/>available?}
    AVAIL2 -->|Yes| RETURN2([Move batch to<br/>PerThreadHeap, return one])
    AVAIL2 -->|No| FRESH["Allocate from unallocated<br/>region via bpSmall pointer"]

    FRESH --> RETURN3([Return fresh memory<br/>from local node])

    BIG --> AVAIL3{Objects<br/>available?}
    AVAIL3 -->|Yes| RETURN4([Return big object])
    AVAIL3 -->|No| FRESH_BIG["Allocate from unallocated<br/>region via bpBig pointer"]
    FRESH_BIG --> RETURN5([Return fresh big object])
```

**Every step allocates from the current thread's node.** The per-thread freelist holds local objects. The per-node freelist holds local objects returned by origin-aware deallocation. The unallocated region is pre-bound to the local node.

### Deallocation Path (Origin-Aware)

This is what makes NUMAlloc unique. When `free(ptr)` is called, NUMAlloc computes the origin node of `ptr` and compares it to the calling thread's node.

```mermaid
flowchart TD
    FREE([free ptr]) --> COMPUTE["Compute origin node<br/>node = offset / region_size"]
    COMPUTE --> COMPARE{Origin node ==<br/>thread's node?}

    COMPARE -->|"Yes — local object"| LOCAL_SIZE{Object size?}
    LOCAL_SIZE -->|Small| LOCAL_THREAD["Return to PerThreadHeap<br/>freelist (no lock needed)"]
    LOCAL_SIZE -->|Big| LOCAL_NODE["Cache in PerThreadHeap<br/>large object cache"]

    COMPARE -->|"No — remote object"| REMOTE["Return to ORIGIN node's<br/>PerNodeHeap freelist<br/>(lock required)"]

    LOCAL_THREAD --> DONE([Done])
    LOCAL_NODE --> DONE
    REMOTE --> DONE
```

**Why this matters**: Traditional allocators put freed objects into the deallocating thread's cache. If Thread A (on Node 0) frees memory that was allocated on Node 1, the object stays in Thread A's cache. Next time Thread A reuses it, every access is a remote access. NUMAlloc avoids this by sending the object back to Node 1's freelist.

---

## 7. Core Component: Incremental Sharing of Huge Pages

Transparent Huge Pages (THP) reduce TLB misses by mapping 2 MB pages instead of 4 KB pages. But they cause memory waste: a single 8-byte allocation can consume 2 MB of physical memory.

### How NUMAlloc handles THP

```mermaid
flowchart LR
    subgraph "Traditional approach<br/>(Scalloc, TEMERAIRE)"
        T1_OLD["Thread 1 gets<br/>2MB superblock"] 
        T2_OLD["Thread 2 gets<br/>2MB superblock"]
        T3_OLD["Thread 3 gets<br/>2MB superblock"]
    end

    subgraph "NUMAlloc: Incremental Sharing"
        HP["2MB Huge Page<br/>(shared by all threads on same node)"]
        T1_NEW["Thread 1 takes<br/>32KB bag"]
        T2_NEW["Thread 2 takes<br/>32KB bag"]
        T3_NEW["Thread 3 takes<br/>32KB bag"]
        HP --> T1_NEW
        HP --> T2_NEW
        HP --> T3_NEW
    end
```

**Key rules:**

1. Threads on the **same node** share huge pages. This is safe because all accesses are local.
2. Each thread takes only **32 KB at a time** ("one bag"), not the full 2 MB.
3. Different size classes can share the same huge page, reducing fragmentation.
4. Metadata memory (e.g., `PerBagSizeInfo`, 8 bytes per bag) is explicitly allocated from normal 4 KB pages via `madvise`, preventing metadata from wasting huge-page space.

This gives NUMAlloc 4.5× less memory consumption than Scalloc and similar overhead to TCMalloc's TEMERAIRE, while still getting the TLB benefits.

---

## 8. Efficient Object Movement

Objects must move between per-thread and per-node freelists to prevent memory blowup (where one thread hoards freed memory that other threads need). NUMAlloc uses two data structures for this:

### Per-Thread Freelist: Tail + nth Pointer

```mermaid
graph LR
    HEAD["Head<br/>(most recent)"] --> N1[obj] --> N2[obj] --> N3[obj] --> NTH["nth pointer"] --> N4[obj] --> N5[obj] --> TAIL["Tail<br/>(least recent)"] --> NULL[NULL]
    
    style NTH fill:#f9cb42,stroke:#854f0b,color:#412402
    style TAIL fill:#f9cb42,stroke:#854f0b,color:#412402
```

When draining to the per-node heap, NUMAlloc moves the **least recently used** objects (between `nth` and `Tail`) in a single pointer swap — no traversal needed. This keeps hot objects in the thread cache and avoids polluting the CPU cache with cold data.

### Per-Node Freelist: Circular Array of Sub-Lists

```mermaid
graph TD
    subgraph "Circular Array"
        SL1["Sub-list 1<br/>Head → Tail"] 
        SL2["Sub-list 2<br/>Head → Tail"]
        SL3["Sub-list 3<br/>Head → Tail"]
        SL4["Sub-list 4<br/>Head → Tail"]
    end

    THREAD["Thread needs<br/>objects"] -->|"Takes entire<br/>sub-list at once"| SL1
```

When a thread needs objects from the per-node heap, it grabs an **entire sub-list** (one Head–Tail pair) in a single lock-guarded operation. No traversal of the shared list, which minimizes lock contention when many threads compete.

---

## 9. Metadata Layout

NUMAlloc keeps all metadata local to the node that uses it:

| Metadata               | Location                                    | Size                                                   |
|------------------------|---------------------------------------------|--------------------------------------------------------|
| `PerBagSizeInfo`       | Separate area, normal pages (via `madvise`) | 8 bytes per bag (32 KB–256 KB depending on size class) |
| `PerBigObjectSizeInfo` | Linked list within each node's region       | Variable (size + availability per big object)          |
| `PerThreadHeap`        | Thread-local storage, on thread's node      | One freelist head per size class                       |
| `PerNodeHeap`          | Within each node's region                   | Freelists + circular array of sub-lists                |

Local metadata means that even metadata access is a local memory operation — no hidden remote accesses from bookkeeping.

---

## 10. Initialization Sequence

```mermaid
sequenceDiagram
    participant App as Application
    participant NA as NUMAlloc
    participant OS as Linux Kernel

    App->>NA: Process starts (LD_PRELOAD)
    NA->>OS: Query NUMA topology<br/>(numa_node_to_cpus)
    OS-->>NA: Node-to-core mapping
    NA->>OS: mmap large contiguous region
    OS-->>NA: Virtual address range
    NA->>OS: mbind region 1 → node 1<br/>mbind region 2 → node 2 ...
    NA->>NA: Initialize PerNodeHeaps<br/>Set bpSmall, bpBig pointers
    NA->>OS: madvise metadata region<br/>(use normal pages, not THP)
    
    Note over App,NA: Ready to serve allocations
    
    App->>NA: pthread_create(thread_fn)
    NA->>OS: Bind thread to node<br/>(sched_setaffinity)
    NA->>NA: Initialize PerThreadHeap<br/>for new thread
```

---

## 11. Large Object Cache

Objects larger than 256 KB bypass the bag-based small-object path and are backed by individual `mmap` regions. Without caching, every large alloc/dealloc pair incurs two syscalls (`mmap` + `munmap`), which costs 1--2 microseconds per pair.

### Architecture

The large object cache is a **per-thread, heap-allocated** structure that stores recently freed mmap regions for reuse. It is kept separate from `PerThreadHeap` so that the hot small-object freelists remain in a compact, cache-friendly struct (~320 bytes).

```mermaid
flowchart TD
    ALLOC([alloc_large]) --> CACHE_CHECK{"Cache hit?<br/>(exact size match)"}
    CACHE_CHECK -->|Yes| REUSE["Reuse cached mmap region<br/>Write header, return pointer"]
    CACHE_CHECK -->|No| MMAP["mmap fresh region<br/>mbind to thread's node"]

    FREE([dealloc_large]) --> CACHE_PUT{"Cache has room?"}
    CACHE_PUT -->|Yes| STORE["Store in cache<br/>madvise if region > 512 KB"]
    CACHE_PUT -->|"No (full)"| EVICT["Evict oldest entry (munmap)<br/>Store new entry"]
```

### Design Decisions

| Decision                          | Rationale                                                                                                                                                       |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Heap-allocated cache**          | Keeps `PerThreadHeap` at ~320 bytes for L1-cache-friendly small-object access. The 16 KB cache array lives in a separate System-allocated block.                |
| **Exact size matching**           | The `alloc_size` (page-rounded total mmap size) is deterministic for a given `(size, align)` pair. Exact matching avoids wasting memory from oversized reuse.   |
| **Packed array with swap-remove** | Entries are stored contiguously from index 0..count. `take()` finds at index 0 (common case: all same size) and swap-removes in O(1). `put()` appends in O(1).  |
| **Eviction on full**              | When the cache is full, the oldest entries are evicted via `munmap` to make room. This allows the cache to adapt when allocation patterns change.               |
| **Conditional madvise**           | `madvise(MADV_FREE/MADV_DONTNEED)` is only called for regions > 512 KB. Below this threshold, the syscall overhead (~1 microsecond) exceeds the memory savings. |
| **1024 slots, 512 MiB limit**     | Handles bulk workloads (1000+ items) while bounding total virtual memory retained per thread.                                                                   |

### Impact

The large object cache eliminates syscall overhead for repeated large allocations. Combined with the extended size classes (up to 256 KB), objects that previously required individual `mmap` regions now go through the fast freelist path:

| Benchmark                  | Before (mmap)     | After             | Improvement |
|----------------------------|-------------------|-------------------|-------------|
| alloc_dealloc 64 KB        | 1.9 microseconds  | 7.7 ns            | **247x**    |
| alloc_dealloc 256 KB       | 2.15 microseconds | 8.4 ns            | **256x**    |
| bulk 1000x 64 KB           | 3.0 ms            | 57.2 microseconds | **52x**     |
| bulk 1000x 256 KB          | 3.94 ms           | 37.4 microseconds | **105x**    |
| page-aligned 64 KB (×100)  | 378 microseconds  | 2.7 microseconds  | **140x**    |
| realloc 4 KB to 64 KB      | 2.38 microseconds | 78 ns             | **31x**     |

---

## 12. Performance Results

### Micro-benchmarks (single-threaded alloc+dealloc, steady state)

| Size   | numalloc   | system (glibc) | mimalloc | jemalloc |
|--------|------------|----------------|----------|----------|
| 8 B    | **7.4 ns** | 13.8 ns        | 10.9 ns  | 9.2 ns   |
| 64 B   | **7.4 ns** | 15.4 ns        | 11.1 ns  | 9.4 ns   |
| 256 B  | **7.3 ns** | 17.3 ns        | 12.4 ns  | 9.3 ns   |
| 1 KB   | **7.4 ns** | 31.2 ns        | 15.7 ns  | 9.8 ns   |
| 4 KB   | **7.4 ns** | 25.1 ns        | 20.4 ns  | 11.3 ns  |
| 16 KB  | **7.3 ns** | 31.7 ns        | 19.3 ns  | 23.2 ns  |
| 64 KB  | **7.7 ns** | 692 ns         | 19.4 ns  | 104 ns   |
| 256 KB | **8.4 ns** | 707 ns         | 956 ns   | 105 ns   |

numalloc is the fastest allocator across all tested sizes, with a flat ~7.5 ns profile from 8 B through 256 KB thanks to the per-thread freelist (small) and large object cache (large).

### Multi-threaded alloc+dealloc (10,000 ops/thread)

| Config          | numalloc             | system           | mimalloc         | jemalloc         |
|-----------------|----------------------|------------------|------------------|------------------|
| 64 B, 2 threads | **108 microseconds** | 658 microseconds | 146 microseconds | 128 microseconds |
| 64 B, 4 threads | **152 microseconds** | 1.1 ms           | 177 microseconds | 159 microseconds |
| 1 KB, 8 threads | **208 microseconds** | 2.36 ms          | 299 microseconds | 232 microseconds |
| 4 KB, 2 threads | **106 microseconds** | 305 microseconds | 243 microseconds | 151 microseconds |
| 4 KB, 8 threads | **239 microseconds** | 469 microseconds | 373 microseconds | 254 microseconds |

numalloc scales well with thread count due to its lock-free per-node Treiber stacks and zero-synchronization per-thread freelists.

### Bulk alloc+free (1000 items, single-threaded)

| Size   | numalloc               | system            | mimalloc             | jemalloc          |
|--------|------------------------|-------------------|----------------------|-------------------|
| 64 B   | **8.2 microseconds**   | 16.8 microseconds | 8.0 microseconds     | 17.1 microseconds |
| 4 KB   | **24.3 microseconds**  | 88.1 microseconds | 24.4 microseconds    | 46.5 microseconds |
| 64 KB  | **57.2 microseconds**  | 709 microseconds  | 43.5 microseconds    | 508 microseconds  |
| 256 KB | **37.4 microseconds**  | 704 microseconds  | 155 microseconds     | 545 microseconds  |

### NUMA-specific results (8-node Intel Xeon, 128 cores)

| Metric                     | NUMAlloc vs. glibc | NUMAlloc vs. mimalloc |
|----------------------------|--------------------|-----------------------|
| Average speedup            | 20.9% faster       | 15.7% faster          |
| Best case (fluidanimate)   | 5.3x faster        | 4.6x faster           |
| Remote accesses            | 9x fewer           | 8x fewer              |
| TLB misses                 | 18x fewer          | --                    |
| Scalability at 128 threads | 88x speedup        | 75x (mimalloc)        |

---

## 13. Limitations and Trade-offs

1. **Higher memory consumption with THP enabled** — The large initial `mmap` triggers huge-page backing. NUMAlloc uses 17.6% more memory than glibc (but 4.5× less than Scalloc). Could be improved with TEMERAIRE-style page management.

2. **Thread binding is mandatory** — NUMAlloc pins threads to nodes. This doesn't conflict with OS scheduling (threads can still move between cores within a node), but it does prevent cross-node migration. For server workloads with thread pools, the node-interleaved binding works well.

3. **Remote deallocation bottleneck** — When many threads free objects originating from the same remote node, the per-node freelist lock can become contended. This is visible in the `larson` benchmark where mimalloc edges ahead at 128 threads.

4. **No support for heterogeneous memory** — NUMAlloc assumes all nodes have equal memory capacity and divides the heap evenly.

---

## 14. Comparison with Other Allocators

```mermaid
graph TD
    subgraph "NUMA-unaware"
        GLIBC["glibc<br/>Sequential allocator<br/>No NUMA support"]
        JEMALLOC["jemalloc<br/>BiBOP-style<br/>No NUMA awareness"]
        HOARD["Hoard<br/>Per-thread heaps<br/>No NUMA awareness"]
    end

    subgraph "Partial NUMA support"
        TCNUMA["TCMalloc-NUMA<br/>Per-node freelists<br/>Many mbind calls<br/>No metadata locality"]
        MIMALLOC["mimalloc<br/>Per-page freelists<br/>Segment-level NUMA<br/>No thread binding"]
        TBB["Intel TBB<br/>NUMA-optimized scheduler<br/>Separate allocator"]
    end

    subgraph "Full NUMA design"
        NUMALLOC["NUMAlloc<br/>Thread + memory binding<br/>Origin-aware dealloc<br/>Incremental THP sharing<br/>Local metadata"]
    end

    GLIBC -.->|"adds node freelists"| TCNUMA
    TCNUMA -.->|"adds binding +<br/>origin tracking"| NUMALLOC
    MIMALLOC -.->|"adds thread binding +<br/>THP sharing"| NUMALLOC
```

---

## 15. When to Use NUMAlloc

**Good fit:**
- Multi-threaded server applications on NUMA hardware (2+ sockets)
- Applications with many small allocations and high thread counts
- Workloads where threads primarily use their own allocated memory
- Environments where THP is enabled

**Poor fit:**
- Single-threaded applications (no NUMA benefit)
- Applications that intentionally share objects across threads on different nodes (producer-consumer across sockets)
- Systems with asymmetric NUMA topologies or heterogeneous memory

---

*Based on: Yang et al., "NUMAlloc: A Faster NUMA Memory Allocator," ISMM 2023, ACM.*