Expand description
Caching CUDA memory allocator.
CudaAllocator wraps a GpuDevice and provides a PyTorch-style caching
memory allocator with:
- Block splitting: oversized free blocks are split, remainder returned to the pool.
- Block coalescing: adjacent freed blocks are merged to reduce fragmentation.
- Stream-aware reuse: blocks track which CUDA streams have used them; a block is only reused when all recorded stream work is complete.
- Dual pools: small (<1 MiB) and large (>=1 MiB) allocations are kept in separate pools to avoid small allocations fragmenting large contiguous regions.
- Statistics:
memory_allocated,max_memory_allocated,memory_reserved, allocation/free counters.
§Design
This is a CPU-side data structure that manages block metadata. Actual GPU
memory allocation/deallocation is delegated to the GpuDevice (cudarc).
The caching layer sits between callers and the driver, intercepting frees
to retain memory for reuse and serving allocs from the cache when possible.
The design follows PyTorch’s CUDACachingAllocator (c10/cuda/). Key
constants match PyTorch:
MIN_BLOCK_SIZE= 512 bytesSMALL_SIZE= 1 MiB (threshold between small/large pools)SMALL_BUFFER= 2 MiB (small pool segment size)MIN_LARGE_ALLOC= 10 MiBROUND_LARGE= 2 MiB (rounding for large allocations)
§Thread safety
CudaAllocator is Send + Sync. Internal state is protected by a Mutex.
The critical section is short (BTreeSet lookup + pointer bookkeeping).
§CL-323
Structs§
- Block
- Metadata for a contiguous region of GPU memory.
- Cuda
Allocator - A caching GPU memory allocator with block pools, splitting, coalescing, and stream-aware reuse.
- Stream
Id - Opaque identifier for a CUDA stream.
Constants§
- LARGE_
BUFFER - Large pool segment size for allocations between 1-10 MiB.
- MIN_
BLOCK_ SIZE - Minimum block size — all allocations are rounded up to at least this.
- MIN_
LARGE_ ALLOC - Allocations between
SMALL_SIZEandMIN_LARGE_ALLOCuse a 20 MiB segment from the driver (to reduce the number of driver calls). - ROUND_
LARGE - Round up large allocations to this granularity.
- SMALL_
BUFFER - Segment size for small pool allocations from the driver.
- SMALL_
SIZE - Largest allocation that goes into the small pool.
Functions§
- get_
allocation_ size - Determine how many bytes to request from the driver for a given request
size (after rounding). Small allocations are packed into
SMALL_BUFFERsegments; mid-range intoLARGE_BUFFER; large are rounded toROUND_LARGE. - round_
size - Round
sizeup to an allocation-friendly boundary.