candle-cuda-vmm

CUDA Virtual Memory Management bindings for elastic KV cache allocation in Candle.

Overview

candle-cuda-vmm provides safe Rust bindings to CUDA's Virtual Memory Management (VMM) APIs, enabling elastic memory allocation for LLM inference workloads. This crate is designed to integrate with the Candle deep learning framework and supports:

Elastic KV Cache Allocation: Allocate memory on-demand rather than pre-allocating large static buffers
Multi-Model Serving: Share GPU memory pools across multiple models with dynamic allocation
Reduced TTFT: Faster time-to-first-token (1.2-28×) in multi-model scenarios vs static allocation
Memory Efficiency: Optimal memory usage for bursty multi-tenant workloads

Features

Safe Rust wrappers around CUDA VMM APIs
RAII-based memory management (automatic cleanup via Drop trait)
Virtual address space reservation with on-demand physical page mapping
Multi-pool support for memory sharing across models
Comprehensive error handling

Quick Start

use candle_cuda_vmm::{VirtualMemoryPool, Result};
use candle_core::Device;

fn main() -> Result<()> {
    let device = Device::new_cuda(0)?;
    
    // Create a pool with 128GB virtual capacity, 2MB pages
    let mut pool = VirtualMemoryPool::new(
        128 * 1024 * 1024 * 1024, // 128GB virtual
        2 * 1024 * 1024,          // 2MB pages
        device,
    )?;
    
    // Allocate 1GB of physical memory on-demand
    let addr = pool.allocate(0, 1024 * 1024 * 1024)?;
    println!("Allocated at virtual address: 0x{:x}", addr);
    
    // Physical memory usage: ~1GB
    println!("Physical usage: {} bytes", pool.physical_memory_usage());
    
    // Deallocate when done
    pool.deallocate(0, 1024 * 1024 * 1024)?;
    
    Ok(())
}

Architecture

The crate provides two main abstractions:

VirtualMemoryPool

Elastic memory pool with virtual memory backing. Reserves large virtual address space but only allocates physical memory on-demand.

let mut pool = VirtualMemoryPool::new(capacity, page_size, device)?;
let addr = pool.allocate(offset, size)?;
pool.deallocate(offset, size)?;

SharedMemoryPool

Multi-model memory pool with global physical memory limits and per-model statistics.

let mut shared_pool = SharedMemoryPool::new(physical_limit, device)?;
shared_pool.register_model("llama-7b", virtual_capacity)?;
let addr = shared_pool.allocate_for_model("llama-7b", size)?;

Requirements

CUDA 11.2 or later (CUDA VMM APIs introduced in 11.2)
NVIDIA GPU with Compute Capability 6.0+ (Pascal or newer)
Rust 1.70+

Performance

Based on KVCached benchmarks:

Allocation Latency: <100μs per 2MB page
TTFT Improvement: 1.2-28× faster vs static allocation (multi-model scenarios)
Memory Overhead: <5% metadata overhead
Throughput: No degradation vs static allocation for single-model workloads

Use Cases

Lightbulb LLM Inference Engine

This crate was built to enable elastic KV cache management in the Lightbulb inference engine:

use candle_cuda_vmm::VirtualMemoryPool;

pub struct ElasticCacheBuilder {
    virtual_pool: VirtualMemoryPool,
    allocated_blocks: Vec<(usize, usize)>,
    // ...
}

impl ElasticCacheBuilder {
    pub fn allocate_for_tokens(&mut self, num_tokens: usize) -> Result<()> {
        let size = num_tokens * self.token_size();
        let offset = self.current_tokens * self.token_size();
        self.virtual_pool.allocate(offset, size)?;
        Ok(())
    }
}

Documentation

References

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

candle-cuda-vmm 0.1.1