Expand description
Hybrid KV cache combining RadixAttention and PagedAttention
This crate implements UniLLM’s innovative memory management system that combines:
- SGLang’s RadixAttention for token-level prefix sharing (L1)
- vLLM’s PagedAttention for block-level efficiency (L2)
- Compressed storage for cold data (L3)
Structs§
- Adaptive
Cache Policy - Adaptive cache policy for managing tier allocation
- Cache
Analysis - Cache
Handle - Handle to a cache entry
- Cuda
Memory Backend - CUDA memory backend implementation
- GpuAllocation
- GPU memory allocation info
- GpuAware
Memory Pool - GPU-aware memory pool that integrates with our hybrid cache
- GpuDevice
Properties - GPU device properties
- GpuDevice
Ptr - GPU device pointer with metadata
- GpuIntegrated
Cache - GPU-integrated cache that combines hybrid caching with direct GPU memory management
- GpuIntegrated
Cache Builder - Builder for GPU-integrated cache with different configurations
- GpuIntegrated
Cache Stats - GpuMemory
Stats - HipMemory
Backend - HIP memory backend implementation
- Hybrid
Cache Stats - HybridKV
Cache - Main hybrid cache implementation
- KVTensor
Pair - KV tensor pair representation
- KvAllocator
Stats - Memory usage statistics
- KvBlock
- A block of pages (typically 16 pages per block)
- KvPage
- A page in the KV cache
- KvSequence
- Sequence information for KV cache management
- Paged
KvAllocator - Paged KV allocator implementation
- Radix
Cache - RadixCache implementation for L1 token-level sharing
Enums§
- Cache
Policy - Cache
Tier - Cache tier enumeration
- GpuBackend
Type - GpuMemory
Error - GPU memory allocation errors
Traits§
- GpuMemory
Backend - GPU backend abstraction for memory operations
Type Aliases§
- GpuMemory
Result - GPU memory allocation result
- Sequence
Id - Sequence ID
- TokenId
- Token ID type