Skip to main content

Crate oxicuda_lm

Crate oxicuda_lm 

Source
Expand description

oxicuda-lm — Large Language Model inference primitives.

This crate provides the model-layer abstractions for LLM inference: a BPE tokenizer, transformer layer building blocks with KV-cache, and complete GPT-2 and LLaMA-2/3 model implementations.

§Architecture overview

 ┌─────────────────────────────────────────────────┐
 │               oxicuda-lm                        │
 │                                                 │
 │  ┌────────────┐  ┌──────────────────────────┐  │
 │  │ tokenizer  │  │       layer              │  │
 │  │            │  │  ┌──────────────────────┐│  │
 │  │ BpeTokenizer│  │  │ TokenEmbedding       ││  │
 │  │ Vocab      │  │  │ RotaryEmbedding (RoPE)││  │
 │  └────────────┘  │  │ MultiHeadAttention   ││  │
 │                  │  │   + LayerKvCache      ││  │
 │  ┌────────────┐  │  │ MlpFfn / SwiGluFfn   ││  │
 │  │  config    │  │  │ RmsNorm / LayerNorm   ││  │
 │  │            │  │  │ GptBlock / LlamaBlock ││  │
 │  │ GptConfig  │  │  │ PastKvCache          ││  │
 │  │ LlamaConfig│  │  └──────────────────────┘│  │
 │  └────────────┘  └──────────────────────────┘  │
 │                                                 │
 │  ┌────────────────────────────────────────────┐│
 │  │                 model                      ││
 │  │  Gpt2Model  ─── forward → logits + cache   ││
 │  │  LlamaModel ─── forward → logits + cache   ││
 │  └────────────────────────────────────────────┘│
 │                                                 │
 │  ┌────────────────────────────────────────────┐│
 │  │  ptx_kernels (5 GPU kernel PTX strings)    ││
 │  │  weights (ModelWeights, WeightTensor)       ││
 │  └────────────────────────────────────────────┘│
 └─────────────────────────────────────────────────┘

§Design

  • Pure Rust: no C/CUDA SDK at compile time.
  • CPU reference implementations: all forward passes are pure-Rust CPU implementations suitable for testing. GPU acceleration is provided by the PTX kernel strings (see ptx_kernels) once a CUDA driver is available at runtime.
  • No unwrap() in library code.
  • KV cache: all attention layers return an updated layer::PastKvCache so incremental decoding is fully supported.

Re-exports§

pub use config::GptConfig;
pub use config::LlamaConfig;
pub use error::LmError;
pub use error::LmResult;
pub use handle::LmHandle;
pub use handle::SmVersion;
pub use layer::LayerKvCache;
pub use layer::LayerNorm;
pub use layer::LearnedPositionalEmbedding;
pub use layer::MlpFfn;
pub use layer::MultiHeadAttention;
pub use layer::PastKvCache;
pub use layer::RmsNorm;
pub use layer::RotaryEmbedding;
pub use layer::SwiGluFfn;
pub use layer::TokenEmbedding;
pub use model::Gpt2Model;
pub use model::LlamaModel;
pub use tokenizer::BpeBuilder;
pub use tokenizer::BpeTokenizer;
pub use tokenizer::Vocab;
pub use weights::ModelWeights;
pub use weights::WeightTensor;

Modules§

config
Model configurations for GPT-2 and LLaMA family models.
error
Error types for the oxicuda-lm crate.
handle
Session handle for oxicuda-lm.
layer
Transformer layer building blocks.
model
Complete LLM model implementations.
ptx_kernels
PTX GPU kernel sources for LLM operations.
tokenizer
BPE tokenizer and vocabulary management.
weights
Model weight storage.