Expand description
§oxillama-runtime
Inference runtime for OxiLLaMa.
Orchestrates the complete inference pipeline: model loading, tokenization, forward pass execution, KV caching, and token sampling.
Re-exports§
pub use batched_attention::batched_flash_attention;pub use beam_search::beam_generate;pub use beam_search::BeamForwardPass;pub use beam_search::BeamHypothesis;pub use beam_search::BeamSearchConfig;pub use beam_search::EngineBeamAdapter;pub use embedding::PoolingMode;pub use engine::EngineConfig;pub use engine::InferenceEngine;pub use engine::FLASH_ATTN_THRESHOLD;pub use error::RuntimeError;pub use error::RuntimeResult;pub use flash_attention::flash_attention;pub use flash_attention::flash_attention_forward;pub use flash_attention::flash_attention_gqa;pub use flash_attention::flash_attention_multi_head;pub use flash_attention::FlashAttentionConfig;pub use kv_cache::prefix::CachedKvState;pub use kv_cache::prefix::PrefixCacheConfig;pub use kv_cache::prefix::PrefixKvCache;pub use kv_cache::KvCache;pub use kv_cache::KvCacheSnapshot;pub use kv_cache::VecBatchedKvView;pub use kv_pool::KvCachePool;pub use lora_loader::apply_lora;pub use metrics::EngineMetrics;pub use metrics::MetricsSnapshot;pub use offload::FilePagerSource;pub use offload::LayerPager;pub use offload::MemoryPressureProbe;pub use offload::OffloadPolicy;pub use offload::PagerSource;pub use offload::ResidentTensor;pub use offload::TensorEntry;pub use offload::TensorId;pub use sampling::advanced::DryStage;pub use sampling::advanced::EtaStage;pub use sampling::advanced::TopAStage;pub use sampling::advanced::TypicalPStage;pub use sampling::advanced::XtcStage;pub use sampling::chain::LogitBias;pub use sampling::chain::SamplerChain;pub use sampling::chain::SamplerStage;pub use sampling::grammar::Grammar;pub use sampling::grammar::GrammarError;pub use sampling::grammar::GrammarState;pub use sampling::grammar::JsonSchemaCompiler;pub use sampling::sample;pub use sampling::Sampler;pub use sampling::SamplerConfig;pub use scheduler::Scheduler;pub use scheduler::SchedulerConfig;pub use scheduler::MAX_DECODE_WAIT_MS;pub use scheduler::PREFILL_CHUNK;pub use sequence_pool::PoolError;pub use sequence_pool::PoolResult;pub use sequence_pool::SequencePool;pub use sequence_pool::SequenceSlot;pub use sequence_pool::SsmStatePool;pub use speculative::SpeculativeConfig;pub use speculative::SpeculativeDeltaSync;pub use speculative::SpeculativeEngine;pub use speculative_async::AsyncSpecConfig;pub use speculative_async::RewindError;pub use speculative_async::Rewindable;pub use speculative_async::SpecStats;pub use speculative_async::SpeculativeDecoder;pub use tool_dispatch::no_op_dispatcher;pub use tool_dispatch::NoOpDispatcher;pub use tool_dispatch::ToolCall;pub use tool_dispatch::ToolCallDetector;pub use tool_dispatch::ToolCallGrammar;pub use tool_dispatch::ToolDispatcher;pub use tool_dispatch::ToolResult;pub use tokenizer_bridge::TokenizerBridge;
Modules§
- batched_
attention - Batched decode-phase attention for continuous batching.
- beam_
search - Beam search decoding for sequence generation.
- embedding
- Embedding pooling modes and the
pool_hidden_stateskernel. - engine
- Main inference engine — orchestrates model loading and text generation.
- error
- Error types for the inference runtime.
- flash_
attention - Tiled flash-attention kernel for memory-efficient attention computation.
- kv_
cache - Key-Value cache for transformer attention.
- kv_pool
- Pooled KV-cache page allocator.
- lora_
loader - Runtime API for applying a loaded LoRA adapter to an
InferenceEngine. - metrics
- Engine metrics — thread-safe counters for throughput and cache statistics.
- offload
- CPU/disk offload with a pinned hot-layer set.
- sampling
- Sampling strategies for next-token selection.
- scheduler
- Continuous batching scheduler.
- sequence_
pool - SSM runtime bridge — polymorphic sequence-state pool.
- snapshot
- Snapshot and resume for
crate::engine::InferenceEnginesessions. - speculative
- Speculative decoding engine.
- speculative_
async - Drafter-async speculative decoding.
- tokenizer_
bridge - Tokenizer bridge — wraps HuggingFace
tokenizerscrate. - tool_
dispatch - Tool-invocation runtime callbacks.
Structs§
- KvSlot
- A single request’s slot within the shared KV pool.
- Loaded
Lora - A fully loaded LoRA adapter, mapping tensor base names to their adapters.
- Lora
Stack - A stack of LoRA adapters applied in order.
Traits§
- Batched
KvView - A view over the KV caches of multiple concurrent requests for batched decode attention.