<div align="center">
# boostr
<h3>ML framework in Rust. Write once. Run on any backend.</h3>
<p>
Production-grade LLM primitives — flash attention, quantization, MoE, state-space
models, KV caching, and distributed training — built on
<a href="https://github.com/ml-rust/numr">numr</a> so the same code runs on CPU, CUDA, and WebGPU.
</p>
<p>
<a href="https://docs.rs/boostr"><strong>Docs</strong></a>
·
<a href="https://crates.io/crates/boostr"><strong>Crate</strong></a>
·
<a href="#module-overview"><strong>Modules</strong></a>
·
<a href="#basic-usage"><strong>Example</strong></a>
·
<a href="CONTRIBUTING.md"><strong>Contributing</strong></a>
</p>
<p>
<a href="https://discord.gg/jBhFk9kHPg">
<img src="https://img.shields.io/discord/1453357769720594543?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Join the Discord">
</a>
</p>
<p>
<a href="https://github.com/ml-rust/boostr/actions/workflows/ci.yml">
<img src="https://img.shields.io/github/actions/workflow/status/ml-rust/boostr/ci.yml?branch=main&label=ci" alt="CI status">
</a>
<a href="https://crates.io/crates/boostr">
<img src="https://img.shields.io/crates/v/boostr" alt="crates.io">
</a>
<a href="https://docs.rs/boostr">
<img src="https://img.shields.io/docsrs/boostr" alt="docs.rs">
</a>
<a href="LICENSE">
<img src="https://img.shields.io/crates/l/boostr" alt="License">
</a>
<a href="https://github.com/ml-rust/boostr/stargazers">
<img src="https://img.shields.io/github/stars/ml-rust/boostr?style=social" alt="GitHub stars">
</a>
</p>
</div>
boostr extends [numr](https://github.com/ml-rust/numr) with production-grade ML primitives. It provides attention mechanisms, quantization support, model architectures, and inference infrastructure — all built on numr's foundational tensors, runtimes, and ops. No reimplementation. No wrappers. Pure extension traits.
## Why boostr
- **One codebase, every backend.** Write once against numr's `Runtime`; run on CPU (SIMD), CUDA (PTX), or WebGPU (WGSL) by switching a feature flag — no per-device dispatch, no rewrite.
- **No vendor lock-in.** Every kernel is native — no cuBLAS, cuDNN, or MKL. Flash attention, quantized matmul, and fused optimizers are all hand-written per backend.
- **Backends are a foundation concern.** Hardware support lives in numr, so new backends added there flow up to boostr automatically — the abstraction is built in, not bolted on per device.
- **Train and serve from one stack.** The same primitives power [oxidizr](https://github.com/ml-rust/oxidizr) training and [blazr](https://github.com/ml-rust/blazr) inference — no Python runtime, single-binary deployment.
## Who it's for
- **LLM trainers** — distributed training with ZeRO (stages 1/2/3), tensor and pipeline parallelism (1F1B, GPipe, ZeroBubble), mixed precision, and fused optimizers.
- **Inference engineers** — flash attention v2/v3, paged KV cache, continuous batching, speculative decoding, and prefix caching for high-throughput serving.
- **Quantization & compression researchers** — 26 GGUF-compatible formats with a dedicated `QuantTensor` type and per-backend dequant / quantized-matmul kernels.
- **Architecture researchers** — LLaMA, Mamba2 (SSD kernels), and hybrid transformer/SSM models, with an extensible system for custom architectures.
- **WebAssembly & edge developers** — the WebGPU backend targets consumer GPUs (Vulkan/Metal/DX12) with no CUDA dependency.
## Key Capabilities
### Quantization
- **26 formats** (GGUF-compatible): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K–Q8K, IQ1S–IQ4XS, TQ1_0, TQ2_0
- **QuantTensor type** for block-quantized data
- **Per-backend kernels**: Native SIMD (CPU), PTX (CUDA), WGSL (WebGPU)
- Zero-copy GGUF loading with memory mapping
### Attention
- **Flash Attention v2/v3** with fused QKV projection
- **Multi-Head Latent Attention (MLA)** — compressed KV cache
- **Grouped Query Attention (GQA)** and multi-head variants
- **Paged attention** for memory-efficient inference
- **Variable-length attention** with ragged tensors
- **Prefix caching** for context reuse
### Position Encodings
- **RoPE**: Split-half, interleaved, ALiBi variants
- **YaRN** for length extrapolation
- Efficient fused implementations on all backends
### Model Architectures
- **LLaMA** — standard and tensor-parallelized
- **Mamba2** — state space models with SSD kernels
- **Hybrid** — mixed transformer/SSM models
- Extensible architecture system for custom models
### Neural Network Modules
- **Linear** — standard and quantized variants
- **Embedding** for token embeddings
- **LayerNorm**, **RMSNorm** with fused implementations
- **MoE layers** with expert routing and load balancing
### Inference Infrastructure
- **Paged KV cache** with block allocator for memory efficiency
- **Request scheduler** with continuous batching
- **Prefix caching** for prompt reuse
- **Speculative decoding** with adaptive draft depth and verification kernels
- **Flash decoding** for single-token decode (CUDA, auto-selected when S_q=1)
### Training
- **Optimizers**: AdamW, Lamb, SGD with gradient clipping
- **Mixed precision** (AMP) with automatic loss scaling
- **Gradient accumulation** and checkpointing
- **Learning rate scheduling** (warmup, cosine, linear decay)
- **Distributed training**:
- ZeRO stage 1/2/3 (parameter/gradient/optimizer sharding)
- Tensor parallelism with communicators
- Pipeline parallelism (1F1B, Gpipe, ZeroBubble schedules)
### Model Loading
- **SafeTensors**: Zero-copy memory-mapped loading
- **GGUF**: Full format support with block-quantized tensors
- Format auto-detection
### Multi-Backend
- **CPU**: SIMD kernels (AVX2, NEON), native ops
- **CUDA**: PTX kernels, Flash Attention v2/v3, fused ops (CUDA 12.x)
- **WebGPU**: WGSL shaders, cross-platform GPU support
## Architecture
```
┌───────────────────────────────────────────────────────┐
│ boostr │
│ (attention, RoPE, MoE, quantization, model loaders) │
└──────────────────────────┬────────────────────────────┘
│
(uses)
│
┌──────────────────────────▼───────────────────────────┐
│ numr │
│ (tensors, ops, runtime, autograd, linalg, FFT) │
└──────────────────────────────────────────────────────┘
```
**Design principles:**
- **Extension traits**: ML ops (AttentionOps, RoPEOps) implemented on numr's clients — not new types
- **QuantTensor**: Separate type for quantized data with custom kernels
- **impl_generic**: Composite ops composed from numr primitives, same logic on all backends
- **Custom kernels**: Dequant, quantized matmul, fused attention use per-backend optimizations (SIMD/PTX/WGSL)
- **Vendor-agnostic**: No cuBLAS, cuDNN, or MKL; all native kernels
## Quick Start
### Installation
Add to `Cargo.toml`:
```toml
[dependencies]
boostr = "<latest-version>"
# With CUDA support (requires CUDA 12.x)
# boostr = { version = "0.1", features = ["cuda"] }
# With WebGPU support
# boostr = { version = "0.1", features = ["wgpu"] }
```
### Build
```bash
# CPU build
cargo build --release
# CUDA support (requires CUDA 12.x)
cargo build --release --features cuda
# WebGPU support
cargo build --release --features wgpu
# Run tests
cargo test
cargo test --features cuda
```
### Basic Usage
```rust
use boostr::*;
use boostr::ops::traits::attention::flash::FlashAttentionOps;
use numr::ops::RandomOps;
use numr::runtime::cpu::{CpuClient, CpuDevice};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = CpuClient::new(CpuDevice::new());
// Create random tensors via numr's RandomOps
let q = client.randn(&[1, 8, 32, 64], DType::F32)?; // [batch, heads, seq, dim]
let k = client.randn(&[1, 8, 32, 64], DType::F32)?;
let v = client.randn(&[1, 8, 32, 64], DType::F32)?;
// Flash Attention forward pass
let (output, _lse) = client.flash_attention_fwd(
&q, &k, &v,
8, // num_heads
8, // num_kv_heads
64, // head_dim
true, // causal
0, // window_size (0 = no sliding window)
None, // kv_seq_len override
)?;
Ok(())
}
```
### Loading a Model
```rust
use boostr::format::Gguf;
use boostr::{CpuRuntime, DType};
use numr::runtime::Runtime;
// Open a GGUF model file (with optional memory mapping)
let mut gguf = Gguf::open("model.gguf")?;
let metadata = gguf.metadata();
let device = <CpuRuntime as Runtime>::Device::default();
// Load tensors — quantized as QuantTensor, others as f32
for name in gguf.tensor_names().map(|s| s.to_string()).collect::<Vec<_>>() {
let info = gguf.tensor_info(&name)?;
if info.ggml_type.is_quantized() {
let qt = gguf.load_tensor_quantized::<CpuRuntime>(&name, &device)?;
} else {
let t = gguf.load_tensor_f32::<CpuRuntime>(&name, &device)?;
}
}
```
### Inference with KV Cache
```rust
use boostr::inference::PagedKvCache;
// Create a paged KV cache for efficient inference
let mut kv_cache = PagedKvCache::new(
&client,
num_layers,
batch_size,
max_seq_len,
head_dim,
)?;
// Process tokens with cache
for token_idx in 0..seq_len {
// ... forward pass using kv_cache ...
kv_cache.update(layer_idx, &k, &v)?;
}
```
## Feature Flags
| `cpu` | CPU backend (default) | numr |
| `cuda` | CUDA GPU acceleration (CUDA 12.x) | numr/cuda, cudarc |
| `nccl` | Multi-GPU via NCCL | numr/nccl |
| `wgpu` | WebGPU cross-platform GPU | numr/wgpu |
| `distributed` | Distributed inference over nexar | nexar, anyhow, bytemuck |
| `f16` | Half-precision float support | numr/f16 |
| `fp8` | FP8 precision support | numr/fp8 |
| `tts-g2p` | Grapheme-to-phoneme via espeak-ng¹ | espeakng |
¹ Requires `libespeak-ng` available at runtime.
## Module Overview
- **`ops/`** — ML-specific operations (attention, RoPE, MoE, etc.)
- **`quant/`** — Quantized tensors and kernels (26 formats)
- **`nn/`** — Neural network modules (Linear, Embedding, LayerNorm, RMSNorm, MoE)
- **`model/`** — Model architectures (LLaMA, Mamba2, Hybrid)
- **`format/`** — Model loaders (SafeTensors, GGUF)
- **`inference/`** — Inference infrastructure (KV cache, scheduling, batching)
- **`optimizer/`** — Training optimizers (AdamW, Lamb, SGD)
- **`trainer/`** — Training utilities and distributed training (ZeRO, tensor/pipeline parallelism)
- **`distributed/`** — Multi-GPU coordination
## Performance
boostr provides production-grade performance through:
- **Fused kernels** — Attention, layer norm, optimizer steps compiled to single kernels
- **Custom quantization** — Per-format SIMD/PTX/WGSL kernels for dequant and quantized matmul
- **Memory efficiency** — Paged KV cache, prefix caching, gradient checkpointing
- **Distributed training** — ZeRO stages, tensor/pipeline parallelism with minimal communication overhead
- **Zero-copy loading** — Memory-mapped GGUF with quantized weights
## Ecosystem
boostr is part of the [ml-rust](https://github.com/ml-rust) organization:
- **[numr](https://github.com/ml-rust/numr)** — Foundational numerical computing (tensors, ops, linalg, FFT)
- **[boostr](https://github.com/ml-rust/boostr)** — ML framework (this project)
- **[oxidizr](https://github.com/ml-rust/oxidizr)** — Training framework for Mamba2, MLA, MoE (uses boostr)
- **[blazr](https://github.com/ml-rust/blazr)** — Inference server with OpenAI-compatible API (uses boostr)
- **[compressr](https://github.com/ml-rust/compressr)** — Model quantization and compression (uses boostr)
- **[splintr](https://github.com/ml-rust/splintr)** — High-performance BPE tokenizer
## Building from Source
### Requirements
- Rust 1.85+
- For CUDA: CUDA 12.x and cudarc dependencies
- For WebGPU: wgpu and platform GPU drivers
### Clone and Build
```bash
git clone https://github.com/ml-rust/boostr.git
cd boostr
# CPU
cargo build --release
# CUDA
cargo build --release --features cuda
# Run tests
cargo test --all-features
# Format and lint
cargo fmt --all
cargo clippy --all-targets
```
## Documentation
- [API Documentation](https://docs.rs/boostr) — Full reference for public API
- [numr Documentation](https://docs.rs/numr) — Tensor and runtime types
## Testing
```bash
# Run all tests
cargo test --all-features
# Specific test suite
cargo test ops::attention --all-features
# Verbose output
cargo test --all-features -- --nocapture
```
## Contributing
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for architecture conventions, the `impl_generic` pattern, and pull request guidance.
## License
Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.
## Acknowledgments
boostr builds on the numerical foundation provided by [numr](https://github.com/ml-rust/numr) and is designed to power production ML infrastructure across training (oxidizr) and inference (blazr).