Crate candle_cuda_vmm

Expand description

§candle-cuda-vmm

CUDA Virtual Memory Management bindings for elastic KV cache allocation in Candle.

This crate provides safe Rust bindings to CUDA’s Virtual Memory Management (VMM) APIs, enabling elastic memory allocation for LLM inference workloads. It integrates with the Candle deep learning framework and supports:

Elastic KV Cache Allocation: Allocate memory on-demand rather than pre-allocating large static buffers
Multi-Model Serving: Share GPU memory pools across multiple models with dynamic allocation
Reduced TTFT: Faster time-to-first-token (1.2-28×) in multi-model scenarios vs static allocation
Memory Efficiency: Optimal memory usage for bursty multi-tenant workloads

§Architecture

The crate is organized into several modules:

error: Error types for VMM operations
cuda_ffi: Low-level CUDA VMM FFI bindings
physical_memory: Physical GPU memory allocation with RAII
mapping: Virtual address space reservation and mapping operations
virtual_memory: High-level elastic memory pool abstractions

§Quick Start

use candle_cuda_vmm::{VirtualMemoryPool, Result};
use candle_core::Device;

fn main() -> Result<()> {
    let device = Device::new_cuda(0)?;
     
    // Create a pool with 128GB virtual capacity, 2MB pages
    let mut pool = VirtualMemoryPool::new(
        128 * 1024 * 1024 * 1024, // 128GB virtual
        2 * 1024 * 1024,          // 2MB pages
        device,
    )?;
     
    // Allocate 1GB of physical memory on-demand
    let addr = pool.allocate(0, 1024 * 1024 * 1024)?;
    println!("Allocated at virtual address: 0x{:x}", addr);
     
    // Physical memory usage: ~1GB
    println!("Physical usage: {} bytes", pool.physical_memory_usage());
     
    // Deallocate when done
    pool.deallocate(0, 1024 * 1024 * 1024)?;
     
    Ok(())
}

§Multi-Model Serving

use candle_cuda_vmm::{SharedMemoryPool, Result};
use candle_core::Device;

fn main() -> Result<()> {
    let device = Device::new_cuda(0)?;
    let mut shared_pool = SharedMemoryPool::new(
        32 * 1024 * 1024 * 1024, // 32GB global physical limit
        device,
    )?;
     
    // Register models
    shared_pool.register_model("llama-7b", 64 * 1024 * 1024 * 1024)?;
    shared_pool.register_model("gpt2", 32 * 1024 * 1024 * 1024)?;
     
    // Allocate for specific model
    let addr = shared_pool.allocate_for_model("llama-7b", 1024 * 1024 * 1024)?;
     
    Ok(())
}

§Requirements

CUDA 11.2 or later (CUDA VMM APIs introduced in 11.2)
NVIDIA GPU with Compute Capability 6.0+ (Pascal or newer)
Rust 1.70+

§Performance

Based on KVCached benchmarks:

Allocation Latency: <100μs per 2MB page
TTFT Improvement: 1.2-28× faster vs static allocation (multi-model scenarios)
Memory Overhead: <5% metadata overhead
Throughput: No degradation vs static allocation for single-model workloads

Re-exports§

pub use error::Result;
pub use error::VmmError;
pub use physical_memory::PhysicalMemoryHandle;
pub use mapping::VirtualAddressRange;
pub use mapping::map_memory;
pub use mapping::unmap_memory;
pub use mapping::set_memory_access;
pub use virtual_memory::VirtualMemoryPool;
pub use virtual_memory::SharedMemoryPool;
pub use virtual_memory::MemoryStats;
pub use virtual_memory::GlobalMemoryStats;
pub use cuda_ffi::AccessFlags;

Modules§

cuda_ffi: Low-level CUDA Virtual Memory Management FFI bindings.
error: Error types for CUDA Virtual Memory Management operations.
mapping: Virtual address space reservation and memory mapping operations.
physical_memory: Physical GPU memory allocation with RAII.
virtual_memory: Virtual memory pool for elastic memory allocation.

Constants§

VERSION: Library version.

Functions§

is_vmm_supported: Check if CUDA VMM is supported on the current system.

Crate candle_cuda_vmm

Crate candle_cuda_vmm Copy item path

§candle-cuda-vmm

§Architecture

§Quick Start

§Multi-Model Serving

§Requirements

§Performance

Re-exports§

Modules§

Constants§

Functions§

Crate candle_cuda_vmm