candle-cuda-vmm
CUDA Virtual Memory Management bindings for elastic KV cache allocation in Candle.
Overview
candle-cuda-vmm provides safe Rust bindings to CUDA's Virtual Memory Management (VMM) APIs, enabling elastic memory allocation for LLM inference workloads. This crate is designed to integrate with the Candle deep learning framework and supports:
- Elastic KV Cache Allocation: Allocate memory on-demand rather than pre-allocating large static buffers
- Multi-Model Serving: Share GPU memory pools across multiple models with dynamic allocation
- Reduced TTFT: Faster time-to-first-token (1.2-28×) in multi-model scenarios vs static allocation
- Memory Efficiency: Optimal memory usage for bursty multi-tenant workloads
Features
- Safe Rust wrappers around CUDA VMM APIs
- RAII-based memory management (automatic cleanup via Drop trait)
- Virtual address space reservation with on-demand physical page mapping
- Multi-pool support for memory sharing across models
- Comprehensive error handling
Quick Start
use ;
use Device;
Architecture
The crate provides two main abstractions:
VirtualMemoryPool
Elastic memory pool with virtual memory backing. Reserves large virtual address space but only allocates physical memory on-demand.
let mut pool = new?;
let addr = pool.allocate?;
pool.deallocate?;
SharedMemoryPool
Multi-model memory pool with global physical memory limits and per-model statistics.
let mut shared_pool = new?;
shared_pool.register_model?;
let addr = shared_pool.allocate_for_model?;
Requirements
- CUDA 11.2 or later (CUDA VMM APIs introduced in 11.2)
- NVIDIA GPU with Compute Capability 6.0+ (Pascal or newer)
- Rust 1.70+
Performance
Based on KVCached benchmarks:
- Allocation Latency: <100μs per 2MB page
- TTFT Improvement: 1.2-28× faster vs static allocation (multi-model scenarios)
- Memory Overhead: <5% metadata overhead
- Throughput: No degradation vs static allocation for single-model workloads
Use Cases
Lightbulb LLM Inference Engine
This crate was built to enable elastic KV cache management in the Lightbulb inference engine:
use VirtualMemoryPool;
Documentation
References
- KVCached Paper: Prism Multi-LLM Serving with VMM
- NVIDIA CUDA Virtual Memory Management
- vAttention: Virtual Memory for PagedAttention
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.