kyro-0.1.0 is not a library.
Kyro LLM Engine
Kyro is a high-throughput LLM serving engine written in Rust, inspired by vLLM and TGI. It leverages the candle ML framework for efficient tensor operations and tokio for high-concurrency async scheduling.
Key Features
- Continuous Batching: Iteration-level scheduling to maximize GPU throughput and eliminate queue wait times.
- PagedAttention: Virtual memory management for KV cache, eliminating memory fragmentation and enabling long-context serving.
- Prefix Caching (Radix Cache): Automatic reuse of KV cache for common prefixes (system prompts, multi-turn history), enabling near-zero Time-To-First-Token (TTFT).
- Chunked Prefill: Eliminates "Prefill Stall" by interleaving large prompt processing with active decode steps.
- Speculative Decoding: Accelerates generation by 2x using a lightweight draft model for token prediction and a target model for parallel verification.
- Distributed Inference: Support for Tensor Parallelism (TP) and Pipeline Parallelism (PP) to serve massive models across multiple GPUs.
- Quantization Support: Native support for FP8 (Hopper), AWQ (4-bit), and GGUF weight loading.
- Constrained Decoding: Structured JSON-mode and Regex-constrained output via grammar-based sampling.
- Multi-LoRA Support: Dynamic loading and switching of many task-specific adapters on a single base model.
- Observability: Real-time Prometheus metrics for TTFT, TBT (Time Between Tokens), and KV cache utilization.
Architecture
- Frontend (Axum): Handles HTTP requests, streaming SSE, and health/metrics endpoints.
- Scheduler (Continuous Batching): Manages request queues, prefix caching, and chunked prefill scheduling.
- Model (Candle): Optimized Transformer blocks with support for multiple quantization formats (FP8, AWQ, GGUF), PagedAttention kernels, and LoRA adapters.
- KV Cache (PagedAttention): Manages logical-to-physical block mapping via a Reference-Counted BlockManager, ensuring cached prefixes are protected from overwrite.
- Distributed (NCCL): Handles multi-node/multi-GPU synchronization via
All-Reduce.
Getting Started
Running the Engine
The API will be available at http://localhost:3000/v1/chat/completions.
Benchmarking
To stress test the engine under concurrent load:
API Documentation
- POST
/v1/chat/completions: OpenAI-compatible completions endpoint. - GET
/health: Liveness and readiness probe. - GET
/metrics: Prometheus-formatted engine metrics.