Crate oxicuda_lm

Expand description

oxicuda-lm — Large Language Model inference primitives.

This crate provides the model-layer abstractions for LLM inference: a BPE tokenizer, transformer layer building blocks with KV-cache, and complete GPT-2 and LLaMA-2/3 model implementations.

§Architecture overview

 ┌─────────────────────────────────────────────────┐
 │               oxicuda-lm                        │
 │                                                 │
 │  ┌────────────┐  ┌──────────────────────────┐  │
 │  │ tokenizer  │  │       layer              │  │
 │  │            │  │  ┌──────────────────────┐│  │
 │  │ BpeTokenizer│  │  │ TokenEmbedding       ││  │
 │  │ Vocab      │  │  │ RotaryEmbedding (RoPE)││  │
 │  └────────────┘  │  │ MultiHeadAttention   ││  │
 │                  │  │   + LayerKvCache      ││  │
 │  ┌────────────┐  │  │ MlpFfn / SwiGluFfn   ││  │
 │  │  config    │  │  │ RmsNorm / LayerNorm   ││  │
 │  │            │  │  │ GptBlock / LlamaBlock ││  │
 │  │ GptConfig  │  │  │ PastKvCache          ││  │
 │  │ LlamaConfig│  │  └──────────────────────┘│  │
 │  └────────────┘  └──────────────────────────┘  │
 │                                                 │
 │  ┌────────────────────────────────────────────┐│
 │  │                 model                      ││
 │  │  Gpt2Model  ─── forward → logits + cache   ││
 │  │  LlamaModel ─── forward → logits + cache   ││
 │  └────────────────────────────────────────────┘│
 │                                                 │
 │  ┌────────────────────────────────────────────┐│
 │  │  ptx_kernels (5 GPU kernel PTX strings)    ││
 │  │  weights (ModelWeights, WeightTensor)       ││
 │  └────────────────────────────────────────────┘│
 └─────────────────────────────────────────────────┘

§Design

Pure Rust: no C/CUDA SDK at compile time.
CPU reference implementations: all forward passes are pure-Rust CPU implementations suitable for testing. GPU acceleration is provided by the PTX kernel strings (see ptx_kernels) once a CUDA driver is available at runtime.
No unwrap() in library code.
KV cache: all attention layers return an updated layer::PastKvCache so incremental decoding is fully supported.

Re-exports§

pub use config::GptConfig;
pub use config::LlamaConfig;
pub use error::LmError;
pub use error::LmResult;
pub use handle::LmHandle;
pub use handle::SmVersion;
pub use layer::LayerKvCache;
pub use layer::LayerNorm;
pub use layer::LearnedPositionalEmbedding;
pub use layer::MlpFfn;
pub use layer::MultiHeadAttention;
pub use layer::PastKvCache;
pub use layer::RmsNorm;
pub use layer::RotaryEmbedding;
pub use layer::SwiGluFfn;
pub use layer::TokenEmbedding;
pub use model::Gpt2Model;
pub use model::LlamaModel;
pub use tokenizer::BpeBuilder;
pub use tokenizer::BpeTokenizer;
pub use tokenizer::Vocab;
pub use weights::ModelWeights;
pub use weights::WeightTensor;

Modules§

config: Model configurations for GPT-2 and LLaMA family models.
error: Error types for the oxicuda-lm crate.
handle: Session handle for oxicuda-lm.
layer: Transformer layer building blocks.
model: Complete LLM model implementations.
ptx_kernels: PTX GPU kernel sources for LLM operations.
tokenizer: BPE tokenizer and vocabulary management.
weights: Model weight storage.

Crate oxicuda_lm

Crate oxicuda_lm Copy item path

§Architecture overview

§Design

Re-exports§

Modules§

Crate oxicuda_lm