Crate oxillama_runtime

Expand description

§oxillama-runtime

Inference runtime for OxiLLaMa.

Orchestrates the complete inference pipeline: model loading, tokenization, forward pass execution, KV caching, and token sampling.

Re-exports§

pub use batched_attention::batched_flash_attention;
pub use beam_search::beam_generate;
pub use beam_search::BeamForwardPass;
pub use beam_search::BeamHypothesis;
pub use beam_search::BeamSearchConfig;
pub use beam_search::EngineBeamAdapter;
pub use embedding::PoolingMode;
pub use engine::EngineConfig;
pub use engine::InferenceEngine;
pub use engine::FLASH_ATTN_THRESHOLD;
pub use error::RuntimeError;
pub use error::RuntimeResult;
pub use flash_attention::flash_attention;
pub use flash_attention::flash_attention_forward;
pub use flash_attention::flash_attention_gqa;
pub use flash_attention::flash_attention_multi_head;
pub use flash_attention::FlashAttentionConfig;
pub use kv_cache::prefix::CachedKvState;
pub use kv_cache::prefix::PrefixCacheConfig;
pub use kv_cache::prefix::PrefixKvCache;
pub use kv_cache::KvCache;
pub use kv_cache::KvCacheSnapshot;
pub use kv_cache::VecBatchedKvView;
pub use kv_pool::KvCachePool;
pub use lora_loader::apply_lora;
pub use metrics::EngineMetrics;
pub use metrics::MetricsSnapshot;
pub use offload::FilePagerSource;
pub use offload::LayerPager;
pub use offload::MemoryPressureProbe;
pub use offload::OffloadPolicy;
pub use offload::PagerSource;
pub use offload::ResidentTensor;
pub use offload::TensorEntry;
pub use offload::TensorId;
pub use sampling::advanced::DryStage;
pub use sampling::advanced::EtaStage;
pub use sampling::advanced::TopAStage;
pub use sampling::advanced::TypicalPStage;
pub use sampling::advanced::XtcStage;
pub use sampling::chain::LogitBias;
pub use sampling::chain::SamplerChain;
pub use sampling::chain::SamplerStage;
pub use sampling::grammar::Grammar;
pub use sampling::grammar::GrammarError;
pub use sampling::grammar::GrammarState;
pub use sampling::grammar::JsonSchemaCompiler;
pub use sampling::sample;
pub use sampling::Sampler;
pub use sampling::SamplerConfig;
pub use scheduler::Scheduler;
pub use scheduler::SchedulerConfig;
pub use scheduler::MAX_DECODE_WAIT_MS;
pub use scheduler::PREFILL_CHUNK;
pub use sequence_pool::PoolError;
pub use sequence_pool::PoolResult;
pub use sequence_pool::SequencePool;
pub use sequence_pool::SequenceSlot;
pub use sequence_pool::SsmStatePool;
pub use speculative::SpeculativeConfig;
pub use speculative::SpeculativeDeltaSync;
pub use speculative::SpeculativeEngine;
pub use speculative_async::AsyncSpecConfig;
pub use speculative_async::RewindError;
pub use speculative_async::Rewindable;
pub use speculative_async::SpecStats;
pub use speculative_async::SpeculativeDecoder;
pub use tool_dispatch::no_op_dispatcher;
pub use tool_dispatch::NoOpDispatcher;
pub use tool_dispatch::ToolCall;
pub use tool_dispatch::ToolCallDetector;
pub use tool_dispatch::ToolCallGrammar;
pub use tool_dispatch::ToolDispatcher;
pub use tool_dispatch::ToolResult;
pub use tokenizer_bridge::TokenizerBridge;

Modules§

batched_attention: Batched decode-phase attention for continuous batching.
beam_search: Beam search decoding for sequence generation.
embedding: Embedding pooling modes and the pool_hidden_states kernel.
engine: Main inference engine — orchestrates model loading and text generation.
error: Error types for the inference runtime.
flash_attention: Tiled flash-attention kernel for memory-efficient attention computation.
kv_cache: Key-Value cache for transformer attention.
kv_pool: Pooled KV-cache page allocator.
lora_loader: Runtime API for applying a loaded LoRA adapter to an InferenceEngine.
metrics: Engine metrics — thread-safe counters for throughput and cache statistics.
offload: CPU/disk offload with a pinned hot-layer set.
sampling: Sampling strategies for next-token selection.
scheduler: Continuous batching scheduler.
sequence_pool: SSM runtime bridge — polymorphic sequence-state pool.
snapshot: Snapshot and resume for crate::engine::InferenceEngine sessions.
speculative: Speculative decoding engine.
speculative_async: Drafter-async speculative decoding.
tokenizer_bridge: Tokenizer bridge — wraps HuggingFace tokenizers crate.
tool_dispatch: Tool-invocation runtime callbacks.

Structs§

KvSlot: A single request’s slot within the shared KV pool.
LoadedLora: A fully loaded LoRA adapter, mapping tensor base names to their adapters.
LoraStack: A stack of LoRA adapters applied in order.

Traits§

BatchedKvView: A view over the KV caches of multiple concurrent requests for batched decode attention.

Crate oxillama_runtime

Crate oxillama_runtime Copy item path

§oxillama-runtime

Re-exports§

Modules§

Structs§

Traits§

Crate oxillama_runtime