# boostr
**ML framework built on numr — attention, quantization, model architectures.**
[](https://crates.io/crates/boostr) [](https://docs.rs/boostr) [](LICENSE)
boostr extends [numr](https://github.com/ml-rust/numr) with production-grade ML primitives. It provides attention mechanisms, quantization support, model architectures, and inference infrastructure — all built on numr's foundational tensors, runtimes, and ops. No reimplementation. No wrappers. Pure extension traits.
## Key Capabilities
### Quantization
- **26 formats** (GGUF-compatible): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K–Q8K, IQ1S–IQ4XS, TQ1_0, TQ2_0
- **QuantTensor type** for block-quantized data
- **Per-backend kernels**: Native SIMD (CPU), PTX (CUDA), WGSL (WebGPU)
- Zero-copy GGUF loading with memory mapping
### Attention
- **Flash Attention v2/v3** with fused QKV projection
- **Multi-Head Latent Attention (MLA)** — compressed KV cache
- **Grouped Query Attention (GQA)** and multi-head variants
- **Paged attention** for memory-efficient inference
- **Variable-length attention** with ragged tensors
- **Prefix caching** for context reuse
### Position Encodings
- **RoPE**: Split-half, interleaved, ALiBi variants
- **YaRN** for length extrapolation
- Efficient fused implementations on all backends
### Model Architectures
- **LLaMA** — standard and tensor-parallelized
- **Mamba2** — state space models with SSD kernels
- **Hybrid** — mixed transformer/SSM models
- Extensible architecture system for custom models
### Neural Network Modules
- **Linear** — standard and quantized variants
- **Embedding** for token embeddings
- **LayerNorm**, **RMSNorm** with fused implementations
- **MoE layers** with expert routing and load balancing
### Inference Infrastructure
- **Paged KV cache** with block allocator for memory efficiency
- **Request scheduler** with continuous batching
- **Prefix caching** for prompt reuse
- **Speculative decoding** with adaptive draft depth and verification kernels
- **Flash decoding** for single-token decode (CUDA, auto-selected when S_q=1)
### Training
- **Optimizers**: AdamW, Lamb, SGD with gradient clipping
- **Mixed precision** (AMP) with automatic loss scaling
- **Gradient accumulation** and checkpointing
- **Learning rate scheduling** (warmup, cosine, linear decay)
- **Distributed training**:
- ZeRO stage 1/2/3 (parameter/gradient/optimizer sharding)
- Tensor parallelism with communicators
- Pipeline parallelism (1F1B, Gpipe, ZeroBubble schedules)
### Model Loading
- **SafeTensors**: Zero-copy memory-mapped loading
- **GGUF**: Full format support with block-quantized tensors
- Format auto-detection
### Multi-Backend
- **CPU**: SIMD kernels (AVX2, NEON), native ops
- **CUDA**: PTX kernels, Flash Attention v2/v3, fused ops (CUDA 12.x)
- **WebGPU**: WGSL shaders, cross-platform GPU support
## Architecture
```
┌──────────────────────────────────────────────────────┐
│ boostr │
│ (attention, RoPE, MoE, quantization, model loaders) │
└──────────────────────────┬──────────────────────────┘
│
(uses)
│
┌──────────────────────────▼──────────────────────────┐
│ numr │
│ (tensors, ops, runtime, autograd, linalg, FFT) │
└──────────────────────────────────────────────────────┘
```
**Design principles:**
- **Extension traits**: ML ops (AttentionOps, RoPEOps) implemented on numr's clients — not new types
- **QuantTensor**: Separate type for quantized data with custom kernels
- **impl_generic**: Composite ops composed from numr primitives, same logic on all backends
- **Custom kernels**: Dequant, quantized matmul, fused attention use per-backend optimizations (SIMD/PTX/WGSL)
- **Vendor-agnostic**: No cuBLAS, cuDNN, or MKL; all native kernels
## Quick Start
### Installation
Add to `Cargo.toml`:
```toml
[dependencies]
boostr = "<latest-version>"
# With CUDA support (requires CUDA 12.x)
# boostr = { version = "0.1", features = ["cuda"] }
# With WebGPU support
# boostr = { version = "0.1", features = ["wgpu"] }
```
### Build
```bash
# CPU build
cargo build --release
# CUDA support (requires CUDA 12.x)
cargo build --release --features cuda
# WebGPU support
cargo build --release --features wgpu
# Run tests
cargo test
cargo test --features cuda
```
### Basic Usage
```rust
use boostr::*;
use numr::ops::RandomOps;
use numr::runtime::cpu::CpuClient;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = CpuClient::default();
// Create random tensors via numr's RandomOps (re-exported by boostr)
let queries = client.randn(&[1, 32, 8, 64], DType::F32)?;
let keys = client.randn(&[1, 32, 8, 64], DType::F32)?;
let values = client.randn(&[1, 32, 8, 64], DType::F32)?;
// Use boostr's extension traits for multi-head attention
use boostr::AttentionOps;
let output = client.multi_head_attention(
&queries, &keys, &values,
None, // no causal mask
1.0, // scale
)?;
Ok(())
}
```
### Loading a Model
```rust
use boostr::format::Gguf;
use boostr::{CpuRuntime, DType};
use numr::runtime::Runtime;
// Open a GGUF model file (with optional memory mapping)
let mut gguf = Gguf::open("model.gguf")?;
let metadata = gguf.metadata();
let device = <CpuRuntime as Runtime>::Device::default();
// Load tensors — quantized as QuantTensor, others as f32
for name in gguf.tensor_names().map(|s| s.to_string()).collect::<Vec<_>>() {
let info = gguf.tensor_info(&name)?;
if info.ggml_type.is_quantized() {
let qt = gguf.load_tensor_quantized::<CpuRuntime>(&name, &device)?;
} else {
let t = gguf.load_tensor_f32::<CpuRuntime>(&name, &device)?;
}
}
```
### Inference with KV Cache
```rust
use boostr::inference::PagedKvCache;
// Create a paged KV cache for efficient inference
let mut kv_cache = PagedKvCache::new(
&client,
num_layers,
batch_size,
max_seq_len,
head_dim,
)?;
// Process tokens with cache
for token_idx in 0..seq_len {
// ... forward pass using kv_cache ...
kv_cache.update(layer_idx, &k, &v)?;
}
```
## Feature Flags
| `cpu` | CPU backend (default) | numr |
| `cuda` | CUDA GPU acceleration | numr/cuda, cudarc |
| `nccl` | Multi-GPU via NCCL | numr/nccl |
| `wgpu` | WebGPU cross-platform GPU | numr/wgpu |
| `f16` | Half-precision float support | numr/f16 |
| `fp8` | FP8 precision support | numr/fp8 |
## Module Overview
- **`ops/`** — ML-specific operations (attention, RoPE, MoE, etc.)
- **`quant/`** — Quantized tensors and kernels (26 formats)
- **`nn/`** — Neural network modules (Linear, Embedding, LayerNorm, RMSNorm, MoE)
- **`model/`** — Model architectures (LLaMA, Mamba2, Hybrid)
- **`format/`** — Model loaders (SafeTensors, GGUF)
- **`inference/`** — Inference infrastructure (KV cache, scheduling, batching)
- **`optimizer/`** — Training optimizers (AdamW, Lamb, SGD)
- **`trainer/`** — Training utilities and distributed training (ZeRO, tensor/pipeline parallelism)
- **`distributed/`** — Multi-GPU coordination
## Performance
boostr provides production-grade performance through:
- **Fused kernels** — Attention, layer norm, optimizer steps compiled to single kernels
- **Custom quantization** — Per-format SIMD/PTX/WGSL kernels for dequant and quantized matmul
- **Memory efficiency** — Paged KV cache, prefix caching, gradient checkpointing
- **Distributed training** — ZeRO stages, tensor/pipeline parallelism with minimal communication overhead
- **Zero-copy loading** — Memory-mapped GGUF with quantized weights
## Ecosystem
boostr is part of the [ml-rust](https://github.com/ml-rust) organization:
- **[numr](https://github.com/ml-rust/numr)** — Foundational numerical computing (tensors, ops, linalg, FFT)
- **[boostr](https://github.com/ml-rust/boostr)** — ML framework (this project)
- **[oxidizr](https://github.com/ml-rust/oxidizr)** — Training framework for Mamba2, MLA, MoE (uses boostr)
- **[blazr](https://github.com/ml-rust/blazr)** — Inference server with OpenAI-compatible API (uses boostr)
- **[compressr](https://github.com/ml-rust/compressr)** — Model quantization and compression (uses boostr)
- **[splintr](https://github.com/ml-rust/splintr)** — High-performance BPE tokenizer
## Building from Source
### Requirements
- Rust 1.85+
- For CUDA: CUDA 12.x and cudarc dependencies
- For WebGPU: wgpu and platform GPU drivers
### Clone and Build
```bash
git clone https://github.com/ml-rust/boostr.git
cd boostr
# CPU
cargo build --release
# CUDA
cargo build --release --features cuda
# Run tests
cargo test --all-features
# Format and lint
cargo fmt --all
cargo clippy --all-targets
```
## Documentation
- [API Documentation](https://docs.rs/boostr) — Full reference for public API
- [numr Documentation](https://docs.rs/numr) — Tensor and runtime types
## Testing
```bash
# Run all tests
cargo test --all-features
# Specific test suite
cargo test ops::attention --all-features
# Verbose output
cargo test --all-features -- --nocapture
```
## Contributing
Contributions are welcome! Please see the main repository's contribution guidelines.
## License
Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.
## Acknowledgments
boostr builds on the numerical foundation provided by [numr](https://github.com/ml-rust/numr) and is designed to power production ML infrastructure across training (oxidizr) and inference (blazr).