numalloc 0.1.2

A blazing-fast, NUMA-aware memory allocator written in pure Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
# NUMAlloc: Architecture Design Document

## 1. Overview

NUMAlloc is a replacement memory allocator purpose-built for Non-Uniform Memory Access (NUMA) machines. It intercepts standard `malloc`/`free` calls via `LD_PRELOAD`, so applications need zero source-code changes. The allocator is built around three core ideas:

1. **Binding-based memory management** — threads and memory regions are pinned to NUMA nodes at startup.
2. **Origin-aware allocation and deallocation** — every heap object carries an implicit "origin node," and freed objects always return to their origin node's freelist.
3. **Incremental sharing of huge pages** — threads on the same node share transparent huge pages (THP) in small increments, combining TLB-miss reduction with low memory waste.

The result: 15.7% faster than mimalloc, 20.9% faster than glibc, with 9× fewer remote memory accesses and 18× fewer TLB misses on average across 25 benchmarks.

---

## 2. The Problem: Why NUMA Needs a Dedicated Allocator

### What is NUMA?

In a NUMA machine, each CPU socket (node) has its own local memory. Accessing local memory is fast; accessing another node's memory ("remote access") is 2–3× slower. A typical server might have 2–8 nodes, each with 16–32 cores.

### Why existing allocators fail on NUMA

Traditional allocators (glibc, TCMalloc, jemalloc, mimalloc) were designed for uniform memory. They cause four categories of NUMA performance problems:

| Problem                       | Root cause                                                                                         | Impact                                                    |
|-------------------------------|----------------------------------------------------------------------------------------------------|-----------------------------------------------------------|
| **Expensive node lookups**    | Checking which node a thread runs on requires a syscall (~10,000 cycles)                           | Allocators avoid checking, so they guess wrong            |
| **Blind deallocation**        | Freed objects go to the deallocating thread's cache regardless of which node the memory belongs to | Reuse of freed memory triggers remote accesses            |
| **Thread migration**          | The OS can move a thread to a different node mid-execution                                         | All of a thread's "local" memory becomes remote overnight |
| **No huge page coordination** | Allocators either ignore THP or give each thread a private 2MB superblock                          | Wasted memory or missed TLB optimization                  |

NUMAlloc tackles all four.

---

## 3. High-Level Architecture

The system is composed of four layers, from hardware up to the application:

```mermaid
graph TB
    subgraph Application Layer
        A[Application Code<br/>malloc / free / realloc]
    end

    subgraph NUMAlloc Interceptor
        B[LD_PRELOAD Interception<br/>Redirects all alloc/dealloc calls]
    end

    subgraph NUMAlloc Core
        C[Thread Binding Manager]
        D[Origin-Aware Allocator]
        E[Incremental Sharing Engine]
        F[Object Movement Engine]
    end

    subgraph OS / Hardware
        G[mbind / set_mempolicy syscalls]
        H[NUMA Hardware Topology<br/>Nodes, Cores, Interconnects]
    end

    A --> B
    B --> C
    B --> D
    B --> E
    C --> G
    D --> F
    E --> G
    G --> H
```

---

## 4. Heap Layout

NUMAlloc's heap is a single large contiguous virtual memory region, divided equally among all NUMA nodes. This layout is the foundation for every other design decision.

```mermaid
graph LR
    subgraph "Contiguous Virtual Memory Region"
        subgraph Node_1["Node 1 Region"]
            S1[Small Objects<br/>bpSmall pointer]
            B1[Big Objects<br/>bpBig pointer]
        end
        subgraph Node_2["Node 2 Region"]
            S2[Small Objects<br/>bpSmall pointer]
            B2[Big Objects<br/>bpBig pointer]
        end
        subgraph Node_N["Node N Region"]
            SN[Small Objects<br/>bpSmall pointer]
            BN[Big Objects<br/>bpBig pointer]
        end
    end
```

### Key properties

- **Each region is bound** to its corresponding physical node via the `mbind` syscall. Region 1 → Node 1, Region 2 → Node 2, etc.
- **Node lookup is a division**: given a pointer `p`, its origin node = `(p - heap_base) / region_size`. No syscall needed.
- **Small objects** (up to 256 KB) are managed in BiBOP-style bags (32 KB for objects up to 16 KB; larger classes use a bag equal to the object size), each holding objects of one size class.
- **Big objects** (> 256 KB) are backed by individual `mmap` regions with a per-thread large object cache to avoid repeated syscalls.

### Per-Thread and Per-Node Heaps

Each thread gets a private per-thread heap (no locks needed). Each node has a shared per-node heap (lock-protected, but accessed less frequently).

```mermaid
graph TD
    subgraph "Node 1"
        direction TB
        T1["Thread 0<br/>PerThreadHeap"]
        T2["Thread 2<br/>PerThreadHeap"]
        T3["Thread 4<br/>PerThreadHeap"]
        NH1["PerNodeHeap<br/>(shared, lock-protected)"]
        UR1["Unallocated Region<br/>bpSmall / bpBig"]
    end

    subgraph "Node 2"
        direction TB
        T4["Thread 1<br/>PerThreadHeap"]
        T5["Thread 3<br/>PerThreadHeap"]
        T6["Thread 5<br/>PerThreadHeap"]
        NH2["PerNodeHeap<br/>(shared, lock-protected)"]
        UR2["Unallocated Region<br/>bpSmall / bpBig"]
    end

    T1 -->|overflow| NH1
    T2 -->|overflow| NH1
    T3 -->|overflow| NH1
    NH1 -->|refill| UR1

    T4 -->|overflow| NH2
    T5 -->|overflow| NH2
    T6 -->|overflow| NH2
    NH2 -->|refill| UR2
```

Each PerThreadHeap has one freelist per size class. When a per-thread freelist overflows (too many freed objects), it drains to the per-node heap. When it's empty, it refills from the per-node heap or from unallocated memory.

---

## 5. Core Component: Binding-Based Memory Management

### Thread Binding

At thread creation, NUMAlloc intercepts `pthread_create` and pins the new thread to a specific NUMA node (not a specific core — the OS can still schedule within the node).

```mermaid
flowchart TD
    START([Thread Created]) --> INTERCEPT[NUMAlloc intercepts<br/>pthread_create]
    INTERCEPT --> POLICY{Binding policy?}
    POLICY -->|Node-interleaved<br/>default| INTERLEAVE["Bind to node = thread_id % num_nodes<br/>Threads spread evenly across nodes"]
    POLICY -->|Node-saturate| SATURATE["Fill node 0 first,<br/>then node 1, etc."]
    POLICY -->|Custom| CUSTOM["Read binding from<br/>config file"]
    INTERLEAVE --> BOUND[Thread bound to node<br/>via sched_setaffinity]
    SATURATE --> BOUND
    CUSTOM --> BOUND
    BOUND --> RUN([Thread runs<br/>always on its assigned node])
```

**Node-interleaved binding** (default): Thread 0 → Node 0, Thread 1 → Node 1, Thread 2 → Node 0, Thread 3 → Node 1, ... This distributes work evenly. Benchmarks show it's ~19% faster than node-saturate for full-core utilization.

**Node-saturate binding**: Fill one node completely before moving to the next. Better when using fewer cores than available.

### Benefits of binding

Once a thread is bound, NUMAlloc knows its node from a thread-local variable — no expensive `getcpu()` / `get_mempolicy()` syscalls. This knowledge powers everything else.

---

## 6. Core Component: Origin-Aware Memory Management

This is the mechanism that eliminates remote accesses during both allocation and deallocation.

### Allocation Path

```mermaid
flowchart TD
    REQ([malloc request]) --> SIZE{Object size?}
    SIZE -->|"Small (≤ 256KB)"| PTH["Check PerThreadHeap<br/>freelist for this size class"]
    SIZE -->|"Big (> 256KB)"| BIG["Check PerThreadHeap<br/>large object cache"]

    PTH --> AVAIL1{Objects<br/>available?}
    AVAIL1 -->|Yes| RETURN1([Return object<br/>from thread cache])
    AVAIL1 -->|No| PNH["Check PerNodeHeap<br/>freelist for this size class"]

    PNH --> AVAIL2{Objects<br/>available?}
    AVAIL2 -->|Yes| RETURN2([Move batch to<br/>PerThreadHeap, return one])
    AVAIL2 -->|No| FRESH["Allocate from unallocated<br/>region via bpSmall pointer"]

    FRESH --> RETURN3([Return fresh memory<br/>from local node])

    BIG --> AVAIL3{Objects<br/>available?}
    AVAIL3 -->|Yes| RETURN4([Return big object])
    AVAIL3 -->|No| FRESH_BIG["Allocate from unallocated<br/>region via bpBig pointer"]
    FRESH_BIG --> RETURN5([Return fresh big object])
```

**Every step allocates from the current thread's node.** The per-thread freelist holds local objects. The per-node freelist holds local objects returned by origin-aware deallocation. The unallocated region is pre-bound to the local node.

### Deallocation Path (Origin-Aware)

This is what makes NUMAlloc unique. When `free(ptr)` is called, NUMAlloc computes the origin node of `ptr` and compares it to the calling thread's node.

```mermaid
flowchart TD
    FREE([free ptr]) --> COMPUTE["Compute origin node<br/>node = offset / region_size"]
    COMPUTE --> COMPARE{Origin node ==<br/>thread's node?}

    COMPARE -->|"Yes — local object"| LOCAL_SIZE{Object size?}
    LOCAL_SIZE -->|Small| LOCAL_THREAD["Return to PerThreadHeap<br/>freelist (no lock needed)"]
    LOCAL_SIZE -->|Big| LOCAL_NODE["Cache in PerThreadHeap<br/>large object cache"]

    COMPARE -->|"No — remote object"| REMOTE["Return to ORIGIN node's<br/>PerNodeHeap freelist<br/>(lock required)"]

    LOCAL_THREAD --> DONE([Done])
    LOCAL_NODE --> DONE
    REMOTE --> DONE
```

**Why this matters**: Traditional allocators put freed objects into the deallocating thread's cache. If Thread A (on Node 0) frees memory that was allocated on Node 1, the object stays in Thread A's cache. Next time Thread A reuses it, every access is a remote access. NUMAlloc avoids this by sending the object back to Node 1's freelist.

---

## 7. Core Component: Incremental Sharing of Huge Pages

Transparent Huge Pages (THP) reduce TLB misses by mapping 2 MB pages instead of 4 KB pages. But they cause memory waste: a single 8-byte allocation can consume 2 MB of physical memory.

### How NUMAlloc handles THP

```mermaid
flowchart LR
    subgraph "Traditional approach<br/>(Scalloc, TEMERAIRE)"
        T1_OLD["Thread 1 gets<br/>2MB superblock"] 
        T2_OLD["Thread 2 gets<br/>2MB superblock"]
        T3_OLD["Thread 3 gets<br/>2MB superblock"]
    end

    subgraph "NUMAlloc: Incremental Sharing"
        HP["2MB Huge Page<br/>(shared by all threads on same node)"]
        T1_NEW["Thread 1 takes<br/>32KB bag"]
        T2_NEW["Thread 2 takes<br/>32KB bag"]
        T3_NEW["Thread 3 takes<br/>32KB bag"]
        HP --> T1_NEW
        HP --> T2_NEW
        HP --> T3_NEW
    end
```

**Key rules:**

1. Threads on the **same node** share huge pages. This is safe because all accesses are local.
2. Each thread takes only **32 KB at a time** ("one bag"), not the full 2 MB.
3. Different size classes can share the same huge page, reducing fragmentation.
4. Metadata memory (e.g., `PerBagSizeInfo`, 8 bytes per bag) is explicitly allocated from normal 4 KB pages via `madvise`, preventing metadata from wasting huge-page space.

This gives NUMAlloc 4.5× less memory consumption than Scalloc and similar overhead to TCMalloc's TEMERAIRE, while still getting the TLB benefits.

---

## 8. Efficient Object Movement

Objects must move between per-thread and per-node freelists to prevent memory blowup (where one thread hoards freed memory that other threads need). NUMAlloc uses two data structures for this:

### Per-Thread Freelist: Tail + nth Pointer

```mermaid
graph LR
    HEAD["Head<br/>(most recent)"] --> N1[obj] --> N2[obj] --> N3[obj] --> NTH["nth pointer"] --> N4[obj] --> N5[obj] --> TAIL["Tail<br/>(least recent)"] --> NULL[NULL]
    
    style NTH fill:#f9cb42,stroke:#854f0b,color:#412402
    style TAIL fill:#f9cb42,stroke:#854f0b,color:#412402
```

When draining to the per-node heap, NUMAlloc moves the **least recently used** objects (between `nth` and `Tail`) in a single pointer swap — no traversal needed. This keeps hot objects in the thread cache and avoids polluting the CPU cache with cold data.

### Per-Node Freelist: Circular Array of Sub-Lists

```mermaid
graph TD
    subgraph "Circular Array"
        SL1["Sub-list 1<br/>Head → Tail"] 
        SL2["Sub-list 2<br/>Head → Tail"]
        SL3["Sub-list 3<br/>Head → Tail"]
        SL4["Sub-list 4<br/>Head → Tail"]
    end

    THREAD["Thread needs<br/>objects"] -->|"Takes entire<br/>sub-list at once"| SL1
```

When a thread needs objects from the per-node heap, it grabs an **entire sub-list** (one Head–Tail pair) in a single lock-guarded operation. No traversal of the shared list, which minimizes lock contention when many threads compete.

---

## 9. Metadata Layout

NUMAlloc keeps all metadata local to the node that uses it:

| Metadata               | Location                                    | Size                                                   |
|------------------------|---------------------------------------------|--------------------------------------------------------|
| `PerBagSizeInfo`       | Separate area, normal pages (via `madvise`) | 8 bytes per bag (32 KB–256 KB depending on size class) |
| `PerBigObjectSizeInfo` | Linked list within each node's region       | Variable (size + availability per big object)          |
| `PerThreadHeap`        | Thread-local storage, on thread's node      | One freelist head per size class                       |
| `PerNodeHeap`          | Within each node's region                   | Freelists + circular array of sub-lists                |

Local metadata means that even metadata access is a local memory operation — no hidden remote accesses from bookkeeping.

---

## 10. Initialization Sequence

```mermaid
sequenceDiagram
    participant App as Application
    participant NA as NUMAlloc
    participant OS as Linux Kernel

    App->>NA: Process starts (LD_PRELOAD)
    NA->>OS: Query NUMA topology<br/>(numa_node_to_cpus)
    OS-->>NA: Node-to-core mapping
    NA->>OS: mmap large contiguous region
    OS-->>NA: Virtual address range
    NA->>OS: mbind region 1 → node 1<br/>mbind region 2 → node 2 ...
    NA->>NA: Initialize PerNodeHeaps<br/>Set bpSmall, bpBig pointers
    NA->>OS: madvise metadata region<br/>(use normal pages, not THP)
    
    Note over App,NA: Ready to serve allocations
    
    App->>NA: pthread_create(thread_fn)
    NA->>OS: Bind thread to node<br/>(sched_setaffinity)
    NA->>NA: Initialize PerThreadHeap<br/>for new thread
```

---

## 11. Large Object Cache

Objects larger than 256 KB bypass the bag-based small-object path and are backed by individual `mmap` regions. Without caching, every large alloc/dealloc pair incurs two syscalls (`mmap` + `munmap`), which costs 1--2 microseconds per pair.

### Architecture

The large object cache is a **per-thread, heap-allocated** structure that stores recently freed mmap regions for reuse. It is kept separate from `PerThreadHeap` so that the hot small-object freelists remain in a compact, cache-friendly struct (~320 bytes).

```mermaid
flowchart TD
    ALLOC([alloc_large]) --> CACHE_CHECK{"Cache hit?<br/>(exact size match)"}
    CACHE_CHECK -->|Yes| REUSE["Reuse cached mmap region<br/>Write header, return pointer"]
    CACHE_CHECK -->|No| MMAP["mmap fresh region<br/>mbind to thread's node"]

    FREE([dealloc_large]) --> CACHE_PUT{"Cache has room?"}
    CACHE_PUT -->|Yes| STORE["Store in cache<br/>madvise if region > 512 KB"]
    CACHE_PUT -->|"No (full)"| EVICT["Evict oldest entry (munmap)<br/>Store new entry"]
```

### Design Decisions

| Decision                          | Rationale                                                                                                                                                       |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Heap-allocated cache**          | Keeps `PerThreadHeap` at ~320 bytes for L1-cache-friendly small-object access. The 16 KB cache array lives in a separate System-allocated block.                |
| **Exact size matching**           | The `alloc_size` (page-rounded total mmap size) is deterministic for a given `(size, align)` pair. Exact matching avoids wasting memory from oversized reuse.   |
| **Packed array with swap-remove** | Entries are stored contiguously from index 0..count. `take()` finds at index 0 (common case: all same size) and swap-removes in O(1). `put()` appends in O(1).  |
| **Eviction on full**              | When the cache is full, the oldest entries are evicted via `munmap` to make room. This allows the cache to adapt when allocation patterns change.               |
| **Conditional madvise**           | `madvise(MADV_FREE/MADV_DONTNEED)` is only called for regions > 512 KB. Below this threshold, the syscall overhead (~1 microsecond) exceeds the memory savings. |
| **1024 slots, 512 MiB limit**     | Handles bulk workloads (1000+ items) while bounding total virtual memory retained per thread.                                                                   |

### Impact

The large object cache eliminates syscall overhead for repeated large allocations. Combined with the extended size classes (up to 256 KB), objects that previously required individual `mmap` regions now go through the fast freelist path:

| Benchmark                  | Before (mmap)     | After             | Improvement |
|----------------------------|-------------------|-------------------|-------------|
| alloc_dealloc 64 KB        | 1.9 microseconds  | 7.7 ns            | **247x**    |
| alloc_dealloc 256 KB       | 2.15 microseconds | 8.4 ns            | **256x**    |
| bulk 1000x 64 KB           | 3.0 ms            | 57.2 microseconds | **52x**     |
| bulk 1000x 256 KB          | 3.94 ms           | 37.4 microseconds | **105x**    |
| page-aligned 64 KB (×100)  | 378 microseconds  | 2.7 microseconds  | **140x**    |
| realloc 4 KB to 64 KB      | 2.38 microseconds | 78 ns             | **31x**     |

---

## 12. Performance Results

### Micro-benchmarks (single-threaded alloc+dealloc, steady state)

| Size   | numalloc   | system (glibc) | mimalloc | jemalloc |
|--------|------------|----------------|----------|----------|
| 8 B    | **7.4 ns** | 13.8 ns        | 10.9 ns  | 9.2 ns   |
| 64 B   | **7.4 ns** | 15.4 ns        | 11.1 ns  | 9.4 ns   |
| 256 B  | **7.3 ns** | 17.3 ns        | 12.4 ns  | 9.3 ns   |
| 1 KB   | **7.4 ns** | 31.2 ns        | 15.7 ns  | 9.8 ns   |
| 4 KB   | **7.4 ns** | 25.1 ns        | 20.4 ns  | 11.3 ns  |
| 16 KB  | **7.3 ns** | 31.7 ns        | 19.3 ns  | 23.2 ns  |
| 64 KB  | **7.7 ns** | 692 ns         | 19.4 ns  | 104 ns   |
| 256 KB | **8.4 ns** | 707 ns         | 956 ns   | 105 ns   |

numalloc is the fastest allocator across all tested sizes, with a flat ~7.5 ns profile from 8 B through 256 KB thanks to the per-thread freelist (small) and large object cache (large).

### Multi-threaded alloc+dealloc (10,000 ops/thread)

| Config          | numalloc             | system           | mimalloc         | jemalloc         |
|-----------------|----------------------|------------------|------------------|------------------|
| 64 B, 2 threads | **108 microseconds** | 658 microseconds | 146 microseconds | 128 microseconds |
| 64 B, 4 threads | **152 microseconds** | 1.1 ms           | 177 microseconds | 159 microseconds |
| 1 KB, 8 threads | **208 microseconds** | 2.36 ms          | 299 microseconds | 232 microseconds |
| 4 KB, 2 threads | **106 microseconds** | 305 microseconds | 243 microseconds | 151 microseconds |
| 4 KB, 8 threads | **239 microseconds** | 469 microseconds | 373 microseconds | 254 microseconds |

numalloc scales well with thread count due to its lock-free per-node Treiber stacks and zero-synchronization per-thread freelists.

### Bulk alloc+free (1000 items, single-threaded)

| Size   | numalloc               | system            | mimalloc             | jemalloc          |
|--------|------------------------|-------------------|----------------------|-------------------|
| 64 B   | **8.2 microseconds**   | 16.8 microseconds | 8.0 microseconds     | 17.1 microseconds |
| 4 KB   | **24.3 microseconds**  | 88.1 microseconds | 24.4 microseconds    | 46.5 microseconds |
| 64 KB  | **57.2 microseconds**  | 709 microseconds  | 43.5 microseconds    | 508 microseconds  |
| 256 KB | **37.4 microseconds**  | 704 microseconds  | 155 microseconds     | 545 microseconds  |

### NUMA-specific results (8-node Intel Xeon, 128 cores)

| Metric                     | NUMAlloc vs. glibc | NUMAlloc vs. mimalloc |
|----------------------------|--------------------|-----------------------|
| Average speedup            | 20.9% faster       | 15.7% faster          |
| Best case (fluidanimate)   | 5.3x faster        | 4.6x faster           |
| Remote accesses            | 9x fewer           | 8x fewer              |
| TLB misses                 | 18x fewer          | --                    |
| Scalability at 128 threads | 88x speedup        | 75x (mimalloc)        |

---

## 13. Limitations and Trade-offs

1. **Higher memory consumption with THP enabled** — The large initial `mmap` triggers huge-page backing. NUMAlloc uses 17.6% more memory than glibc (but 4.5× less than Scalloc). Could be improved with TEMERAIRE-style page management.

2. **Thread binding is mandatory** — NUMAlloc pins threads to nodes. This doesn't conflict with OS scheduling (threads can still move between cores within a node), but it does prevent cross-node migration. For server workloads with thread pools, the node-interleaved binding works well.

3. **Remote deallocation bottleneck** — When many threads free objects originating from the same remote node, the per-node freelist lock can become contended. This is visible in the `larson` benchmark where mimalloc edges ahead at 128 threads.

4. **No support for heterogeneous memory** — NUMAlloc assumes all nodes have equal memory capacity and divides the heap evenly.

---

## 14. Comparison with Other Allocators

```mermaid
graph TD
    subgraph "NUMA-unaware"
        GLIBC["glibc<br/>Sequential allocator<br/>No NUMA support"]
        JEMALLOC["jemalloc<br/>BiBOP-style<br/>No NUMA awareness"]
        HOARD["Hoard<br/>Per-thread heaps<br/>No NUMA awareness"]
    end

    subgraph "Partial NUMA support"
        TCNUMA["TCMalloc-NUMA<br/>Per-node freelists<br/>Many mbind calls<br/>No metadata locality"]
        MIMALLOC["mimalloc<br/>Per-page freelists<br/>Segment-level NUMA<br/>No thread binding"]
        TBB["Intel TBB<br/>NUMA-optimized scheduler<br/>Separate allocator"]
    end

    subgraph "Full NUMA design"
        NUMALLOC["NUMAlloc<br/>Thread + memory binding<br/>Origin-aware dealloc<br/>Incremental THP sharing<br/>Local metadata"]
    end

    GLIBC -.->|"adds node freelists"| TCNUMA
    TCNUMA -.->|"adds binding +<br/>origin tracking"| NUMALLOC
    MIMALLOC -.->|"adds thread binding +<br/>THP sharing"| NUMALLOC
```

---

## 15. When to Use NUMAlloc

**Good fit:**
- Multi-threaded server applications on NUMA hardware (2+ sockets)
- Applications with many small allocations and high thread counts
- Workloads where threads primarily use their own allocated memory
- Environments where THP is enabled

**Poor fit:**
- Single-threaded applications (no NUMA benefit)
- Applications that intentionally share objects across threads on different nodes (producer-consumer across sockets)
- Systems with asymmetric NUMA topologies or heterogeneous memory

---

*Based on: Yang et al., "NUMAlloc: A Faster NUMA Memory Allocator," ISMM 2023, ACM.*