1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
//! # candle-cuda-vmm
//!
//! CUDA Virtual Memory Management bindings for elastic KV cache allocation in Candle.
//!
//! This crate provides safe Rust bindings to CUDA's Virtual Memory Management (VMM) APIs,
//! enabling elastic memory allocation for LLM inference workloads. It integrates with the
//! Candle deep learning framework and supports:
//!
//! - **Elastic KV Cache Allocation**: Allocate memory on-demand rather than pre-allocating
//! large static buffers
//! - **Multi-Model Serving**: Share GPU memory pools across multiple models with dynamic
//! allocation
//! - **Reduced TTFT**: Faster time-to-first-token (1.2-28×) in multi-model scenarios vs
//! static allocation
//! - **Memory Efficiency**: Optimal memory usage for bursty multi-tenant workloads
//!
//! ## Architecture
//!
//! The crate is organized into several modules:
//!
//! - [`error`]: Error types for VMM operations
//! - [`cuda_ffi`]: Low-level CUDA VMM FFI bindings
//! - [`physical_memory`]: Physical GPU memory allocation with RAII
//! - [`mapping`]: Virtual address space reservation and mapping operations
//! - [`virtual_memory`]: High-level elastic memory pool abstractions
//!
//! ## Quick Start
//!
//! ```no_run
//! use candle_cuda_vmm::{VirtualMemoryPool, Result};
//! use candle_core::Device;
//!
//! fn main() -> Result<()> {
//! let device = Device::new_cuda(0)?;
//!
//! // Create a pool with 128GB virtual capacity, 2MB pages
//! let mut pool = VirtualMemoryPool::new(
//! 128 * 1024 * 1024 * 1024, // 128GB virtual
//! 2 * 1024 * 1024, // 2MB pages
//! device,
//! )?;
//!
//! // Allocate 1GB of physical memory on-demand
//! let addr = pool.allocate(0, 1024 * 1024 * 1024)?;
//! println!("Allocated at virtual address: 0x{:x}", addr);
//!
//! // Physical memory usage: ~1GB
//! println!("Physical usage: {} bytes", pool.physical_memory_usage());
//!
//! // Deallocate when done
//! pool.deallocate(0, 1024 * 1024 * 1024)?;
//!
//! Ok(())
//! }
//! ```
//!
//! ## Multi-Model Serving
//!
//! ```no_run
//! use candle_cuda_vmm::{SharedMemoryPool, Result};
//! use candle_core::Device;
//!
//! fn main() -> Result<()> {
//! let device = Device::new_cuda(0)?;
//! let mut shared_pool = SharedMemoryPool::new(
//! 32 * 1024 * 1024 * 1024, // 32GB global physical limit
//! device,
//! )?;
//!
//! // Register models
//! shared_pool.register_model("llama-7b", 64 * 1024 * 1024 * 1024)?;
//! shared_pool.register_model("gpt2", 32 * 1024 * 1024 * 1024)?;
//!
//! // Allocate for specific model
//! let addr = shared_pool.allocate_for_model("llama-7b", 1024 * 1024 * 1024)?;
//!
//! Ok(())
//! }
//! ```
//!
//! ## Requirements
//!
//! - CUDA 11.2 or later (CUDA VMM APIs introduced in 11.2)
//! - NVIDIA GPU with Compute Capability 6.0+ (Pascal or newer)
//! - Rust 1.70+
//!
//! ## Performance
//!
//! Based on KVCached benchmarks:
//!
//! - **Allocation Latency**: <100μs per 2MB page
//! - **TTFT Improvement**: 1.2-28× faster vs static allocation (multi-model scenarios)
//! - **Memory Overhead**: <5% metadata overhead
//! - **Throughput**: No degradation vs static allocation for single-model workloads
// Re-export main types
pub use ;
pub use PhysicalMemoryHandle;
pub use ;
pub use ;
pub use AccessFlags;
/// Library version.
pub const VERSION: &str = env!;
/// Check if CUDA VMM is supported on the current system.
///
/// # Returns
/// True if CUDA VMM is available, false otherwise.