Expand description
oxicuda-lm — Large Language Model inference primitives.
This crate provides the model-layer abstractions for LLM inference: a BPE tokenizer, transformer layer building blocks with KV-cache, and complete GPT-2 and LLaMA-2/3 model implementations.
§Architecture overview
┌─────────────────────────────────────────────────┐
│ oxicuda-lm │
│ │
│ ┌────────────┐ ┌──────────────────────────┐ │
│ │ tokenizer │ │ layer │ │
│ │ │ │ ┌──────────────────────┐│ │
│ │ BpeTokenizer│ │ │ TokenEmbedding ││ │
│ │ Vocab │ │ │ RotaryEmbedding (RoPE)││ │
│ └────────────┘ │ │ MultiHeadAttention ││ │
│ │ │ + LayerKvCache ││ │
│ ┌────────────┐ │ │ MlpFfn / SwiGluFfn ││ │
│ │ config │ │ │ RmsNorm / LayerNorm ││ │
│ │ │ │ │ GptBlock / LlamaBlock ││ │
│ │ GptConfig │ │ │ PastKvCache ││ │
│ │ LlamaConfig│ │ └──────────────────────┘│ │
│ └────────────┘ └──────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐│
│ │ model ││
│ │ Gpt2Model ─── forward → logits + cache ││
│ │ LlamaModel ─── forward → logits + cache ││
│ └────────────────────────────────────────────┘│
│ │
│ ┌────────────────────────────────────────────┐│
│ │ ptx_kernels (5 GPU kernel PTX strings) ││
│ │ weights (ModelWeights, WeightTensor) ││
│ └────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘§Design
- Pure Rust: no C/CUDA SDK at compile time.
- CPU reference implementations: all forward passes are pure-Rust
CPU implementations suitable for testing. GPU acceleration is provided
by the PTX kernel strings (see
ptx_kernels) once a CUDA driver is available at runtime. - No unwrap() in library code.
- KV cache: all attention layers return an updated
layer::PastKvCacheso incremental decoding is fully supported.
Re-exports§
pub use config::GptConfig;pub use config::LlamaConfig;pub use error::LmError;pub use error::LmResult;pub use handle::LmHandle;pub use handle::SmVersion;pub use layer::LayerKvCache;pub use layer::LayerNorm;pub use layer::LearnedPositionalEmbedding;pub use layer::MlpFfn;pub use layer::MultiHeadAttention;pub use layer::PastKvCache;pub use layer::RmsNorm;pub use layer::RotaryEmbedding;pub use layer::SwiGluFfn;pub use layer::TokenEmbedding;pub use model::Gpt2Model;pub use model::LlamaModel;pub use tokenizer::BpeBuilder;pub use tokenizer::BpeTokenizer;pub use tokenizer::Vocab;pub use weights::ModelWeights;pub use weights::WeightTensor;
Modules§
- config
- Model configurations for GPT-2 and LLaMA family models.
- error
- Error types for the
oxicuda-lmcrate. - handle
- Session handle for
oxicuda-lm. - layer
- Transformer layer building blocks.
- model
- Complete LLM model implementations.
- ptx_
kernels - PTX GPU kernel sources for LLM operations.
- tokenizer
- BPE tokenizer and vocabulary management.
- weights
- Model weight storage.