Module allocator

Expand description

Caching CUDA memory allocator.

CudaAllocator wraps a GpuDevice and provides a PyTorch-style caching memory allocator with:

Block splitting: oversized free blocks are split, remainder returned to the pool.
Block coalescing: adjacent freed blocks are merged to reduce fragmentation.
Stream-aware reuse: blocks track which CUDA streams have used them; a block is only reused when all recorded stream work is complete.
Dual pools: small (<1 MiB) and large (>=1 MiB) allocations are kept in separate pools to avoid small allocations fragmenting large contiguous regions.
Statistics: memory_allocated, max_memory_allocated, memory_reserved, allocation/free counters.

§Design

This is a CPU-side data structure that manages block metadata. Actual GPU memory allocation/deallocation is delegated to the GpuDevice (cudarc). The caching layer sits between callers and the driver, intercepting frees to retain memory for reuse and serving allocs from the cache when possible.

The design follows PyTorch’s CUDACachingAllocator (c10/cuda/). Key constants match PyTorch:

MIN_BLOCK_SIZE = 512 bytes
SMALL_SIZE = 1 MiB (threshold between small/large pools)
SMALL_BUFFER = 2 MiB (small pool segment size)
MIN_LARGE_ALLOC = 10 MiB
ROUND_LARGE = 2 MiB (rounding for large allocations)

§Thread safety

CudaAllocator is Send + Sync. Internal state is protected by a Mutex. The critical section is short (BTreeSet lookup + pointer bookkeeping).

§CL-323

Structs§

Block: Metadata for a contiguous region of GPU memory.
CudaAllocator: A caching GPU memory allocator with block pools, splitting, coalescing, and stream-aware reuse.
StreamId: Opaque identifier for a CUDA stream.

Constants§

LARGE_BUFFER: Large pool segment size for allocations between 1-10 MiB.
MIN_BLOCK_SIZE: Minimum block size — all allocations are rounded up to at least this.
MIN_LARGE_ALLOC: Allocations between SMALL_SIZE and MIN_LARGE_ALLOC use a 20 MiB segment from the driver (to reduce the number of driver calls).
ROUND_LARGE: Round up large allocations to this granularity.
SMALL_BUFFER: Segment size for small pool allocations from the driver.
SMALL_SIZE: Largest allocation that goes into the small pool.

Functions§

get_allocation_size: Determine how many bytes to request from the driver for a given request size (after rounding). Small allocations are packed into SMALL_BUFFER segments; mid-range into LARGE_BUFFER; large are rounded to ROUND_LARGE.
round_size: Round size up to an allocation-friendly boundary.