Crate unillm_kv

Expand description

Hybrid KV cache combining RadixAttention and PagedAttention

This crate implements UniLLM’s innovative memory management system that combines:

Structs§

AdaptiveCachePolicy: Adaptive cache policy for managing tier allocation
CacheAnalysis
CacheHandle: Handle to a cache entry
CudaMemoryBackend: CUDA memory backend implementation
GpuAllocation: GPU memory allocation info
GpuAwareMemoryPool: GPU-aware memory pool that integrates with our hybrid cache
GpuDeviceProperties: GPU device properties
GpuDevicePtr: GPU device pointer with metadata
GpuIntegratedCache: GPU-integrated cache that combines hybrid caching with direct GPU memory management
GpuIntegratedCacheBuilder: Builder for GPU-integrated cache with different configurations
GpuIntegratedCacheStats
GpuMemoryStats
HipMemoryBackend: HIP memory backend implementation
HybridCacheStats
HybridKVCache: Main hybrid cache implementation
KVTensorPair: KV tensor pair representation
KvAllocatorStats: Memory usage statistics
KvBlock: A block of pages (typically 16 pages per block)
KvPage: A page in the KV cache
KvSequence: Sequence information for KV cache management
PagedKvAllocator: Paged KV allocator implementation
RadixCache: RadixCache implementation for L1 token-level sharing