Skip to main content

Module allocator

Module allocator 

Source
Expand description

Caching CUDA memory allocator.

CudaAllocator wraps a GpuDevice and provides a PyTorch-style caching memory allocator with:

  • Block splitting: oversized free blocks are split, remainder returned to the pool.
  • Block coalescing: adjacent freed blocks are merged to reduce fragmentation.
  • Stream-aware reuse: blocks track which CUDA streams have used them; a block is only reused when all recorded stream work is complete.
  • Dual pools: small (<1 MiB) and large (>=1 MiB) allocations are kept in separate pools to avoid small allocations fragmenting large contiguous regions.
  • Statistics: memory_allocated, max_memory_allocated, memory_reserved, allocation/free counters.

§Design

This is a CPU-side data structure that manages block metadata. Actual GPU memory allocation/deallocation is delegated to the GpuDevice (cudarc). The caching layer sits between callers and the driver, intercepting frees to retain memory for reuse and serving allocs from the cache when possible.

The design follows PyTorch’s CUDACachingAllocator (c10/cuda/). Key constants match PyTorch:

  • MIN_BLOCK_SIZE = 512 bytes
  • SMALL_SIZE = 1 MiB (threshold between small/large pools)
  • SMALL_BUFFER = 2 MiB (small pool segment size)
  • MIN_LARGE_ALLOC = 10 MiB
  • ROUND_LARGE = 2 MiB (rounding for large allocations)

§Thread safety

CudaAllocator is Send + Sync. Internal state is protected by a Mutex. The critical section is short (BTreeSet lookup + pointer bookkeeping).

§CL-323

Structs§

Block
Metadata for a contiguous region of GPU memory.
CudaAllocator
A caching GPU memory allocator with block pools, splitting, coalescing, and stream-aware reuse.
StreamId
Opaque identifier for a CUDA stream.

Constants§

LARGE_BUFFER
Large pool segment size for allocations between 1-10 MiB.
MIN_BLOCK_SIZE
Minimum block size — all allocations are rounded up to at least this.
MIN_LARGE_ALLOC
Allocations between SMALL_SIZE and MIN_LARGE_ALLOC use a 20 MiB segment from the driver (to reduce the number of driver calls).
ROUND_LARGE
Round up large allocations to this granularity.
SMALL_BUFFER
Segment size for small pool allocations from the driver.
SMALL_SIZE
Largest allocation that goes into the small pool.

Functions§

get_allocation_size
Determine how many bytes to request from the driver for a given request size (after rounding). Small allocations are packed into SMALL_BUFFER segments; mid-range into LARGE_BUFFER; large are rounded to ROUND_LARGE.
round_size
Round size up to an allocation-friendly boundary.