Skip to main content

Crate unillm_kv

Crate unillm_kv 

Source
Expand description

Hybrid KV cache combining RadixAttention and PagedAttention

This crate implements UniLLM’s innovative memory management system that combines:

  • SGLang’s RadixAttention for token-level prefix sharing (L1)
  • vLLM’s PagedAttention for block-level efficiency (L2)
  • Compressed storage for cold data (L3)

Structs§

AdaptiveCachePolicy
Adaptive cache policy for managing tier allocation
CacheAnalysis
CacheHandle
Handle to a cache entry
CudaMemoryBackend
CUDA memory backend implementation
GpuAllocation
GPU memory allocation info
GpuAwareMemoryPool
GPU-aware memory pool that integrates with our hybrid cache
GpuDeviceProperties
GPU device properties
GpuDevicePtr
GPU device pointer with metadata
GpuIntegratedCache
GPU-integrated cache that combines hybrid caching with direct GPU memory management
GpuIntegratedCacheBuilder
Builder for GPU-integrated cache with different configurations
GpuIntegratedCacheStats
GpuMemoryStats
HipMemoryBackend
HIP memory backend implementation
HybridCacheStats
HybridKVCache
Main hybrid cache implementation
KVTensorPair
KV tensor pair representation
KvAllocatorStats
Memory usage statistics
KvBlock
A block of pages (typically 16 pages per block)
KvPage
A page in the KV cache
KvSequence
Sequence information for KV cache management
PagedKvAllocator
Paged KV allocator implementation
RadixCache
RadixCache implementation for L1 token-level sharing

Enums§

CachePolicy
CacheTier
Cache tier enumeration
GpuBackendType
GpuMemoryError
GPU memory allocation errors

Traits§

GpuMemoryBackend
GPU backend abstraction for memory operations

Type Aliases§

GpuMemoryResult
GPU memory allocation result
SequenceId
Sequence ID
TokenId
Token ID type