boostr 0.2.0-beta.2

<div align="center">

# boostr

<h3>ML framework in Rust. Write once. Run on any backend.</h3>

<p>
Production-grade LLM primitives — flash attention, quantization, MoE, state-space
models, KV caching, and distributed training — built on
<a href="https://github.com/ml-rust/numr">numr</a> so the same code runs on CPU, CUDA, and WebGPU.
</p>

<p>
  <a href="https://docs.rs/boostr"><strong>Docs</strong></a>
  · 
  <a href="https://crates.io/crates/boostr"><strong>Crate</strong></a>
  ·
  <a href="#module-overview"><strong>Modules</strong></a>
  ·
  <a href="#basic-usage"><strong>Example</strong></a>
  ·
  <a href="CONTRIBUTING.md"><strong>Contributing</strong></a>
</p>

<p>
  <a href="https://discord.gg/jBhFk9kHPg">
    <img src="https://img.shields.io/discord/1453357769720594543?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Join the Discord">
  </a>
</p>

<p>
  <a href="https://github.com/ml-rust/boostr/actions/workflows/ci.yml">
    <img src="https://img.shields.io/github/actions/workflow/status/ml-rust/boostr/ci.yml?branch=main&label=ci" alt="CI status">
  </a>
  <a href="https://crates.io/crates/boostr">
    <img src="https://img.shields.io/crates/v/boostr" alt="crates.io">
  </a>
  <a href="https://docs.rs/boostr">
    <img src="https://img.shields.io/docsrs/boostr" alt="docs.rs">
  </a>
  <a href="LICENSE">
    <img src="https://img.shields.io/crates/l/boostr" alt="License">
  </a>
  <a href="https://github.com/ml-rust/boostr/stargazers">
    <img src="https://img.shields.io/github/stars/ml-rust/boostr?style=social" alt="GitHub stars">
  </a>
</p>

</div>

boostr extends [numr](https://github.com/ml-rust/numr) with production-grade ML primitives. It provides attention mechanisms, quantization support, model architectures, and inference infrastructure — all built on numr's foundational tensors, runtimes, and ops. No reimplementation. No wrappers. Pure extension traits.

## Why boostr

- **One codebase, every backend.** Write once against numr's `Runtime`; run on CPU (SIMD), CUDA (PTX), or WebGPU (WGSL) by switching a feature flag — no per-device dispatch, no rewrite.
- **No vendor lock-in.** Every kernel is native — no cuBLAS, cuDNN, or MKL. Flash attention, quantized matmul, and fused optimizers are all hand-written per backend.
- **Backends are a foundation concern.** Hardware support lives in numr, so new backends added there flow up to boostr automatically — the abstraction is built in, not bolted on per device.
- **Train and serve from one stack.** The same primitives power [oxidizr](https://github.com/ml-rust/oxidizr) training and [blazr](https://github.com/ml-rust/blazr) inference — no Python runtime, single-binary deployment.

## Who it's for

- **LLM trainers** — distributed training with ZeRO (stages 1/2/3), tensor and pipeline parallelism (1F1B, GPipe, ZeroBubble), mixed precision, and fused optimizers.
- **Inference engineers** — flash attention v2/v3, paged KV cache, continuous batching, speculative decoding, and prefix caching for high-throughput serving.
- **Quantization & compression researchers** — 26 GGUF-compatible formats with a dedicated `QuantTensor` type and per-backend dequant / quantized-matmul kernels.
- **Architecture researchers** — LLaMA, Mamba2 (SSD kernels), and hybrid transformer/SSM models, with an extensible system for custom architectures.
- **WebAssembly & edge developers** — the WebGPU backend targets consumer GPUs (Vulkan/Metal/DX12) with no CUDA dependency.

## Key Capabilities

### Quantization

- **26 formats** (GGUF-compatible): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K–Q8K, IQ1S–IQ4XS, TQ1_0, TQ2_0
- **QuantTensor type** for block-quantized data
- **Per-backend kernels**: Native SIMD (CPU), PTX (CUDA), WGSL (WebGPU)
- Zero-copy GGUF loading with memory mapping

### Attention

- **Flash Attention v2/v3** with fused QKV projection
- **Multi-Head Latent Attention (MLA)** — compressed KV cache
- **Grouped Query Attention (GQA)** and multi-head variants
- **Paged attention** for memory-efficient inference
- **Variable-length attention** with ragged tensors
- **Prefix caching** for context reuse

### Position Encodings

- **RoPE**: Split-half, interleaved, ALiBi variants
- **YaRN** for length extrapolation
- Efficient fused implementations on all backends

### Model Architectures

- **LLaMA** — standard and tensor-parallelized
- **Mamba2** — state space models with SSD kernels
- **Hybrid** — mixed transformer/SSM models
- Extensible architecture system for custom models

### Neural Network Modules

- **Linear** — standard and quantized variants
- **Embedding** for token embeddings
- **LayerNorm**, **RMSNorm** with fused implementations
- **MoE layers** with expert routing and load balancing

### Inference Infrastructure

- **Paged KV cache** with block allocator for memory efficiency
- **Request scheduler** with continuous batching
- **Prefix caching** for prompt reuse
- **Speculative decoding** with adaptive draft depth and verification kernels
- **Flash decoding** for single-token decode (CUDA, auto-selected when S_q=1)

### Training

- **Optimizers**: AdamW, Lamb, SGD with gradient clipping
- **Mixed precision** (AMP) with automatic loss scaling
- **Gradient accumulation** and checkpointing
- **Learning rate scheduling** (warmup, cosine, linear decay)
- **Distributed training**:
  - ZeRO stage 1/2/3 (parameter/gradient/optimizer sharding)
  - Tensor parallelism with communicators
  - Pipeline parallelism (1F1B, Gpipe, ZeroBubble schedules)

### Model Loading

- **SafeTensors**: Zero-copy memory-mapped loading
- **GGUF**: Full format support with block-quantized tensors
- Format auto-detection

### Multi-Backend

- **CPU**: SIMD kernels (AVX2, NEON), native ops
- **CUDA**: PTX kernels, Flash Attention v2/v3, fused ops (CUDA 12.x)
- **WebGPU**: WGSL shaders, cross-platform GPU support

## Architecture

```
┌───────────────────────────────────────────────────────┐
│                    boostr                             │
│   (attention, RoPE, MoE, quantization, model loaders) │
└──────────────────────────┬────────────────────────────┘
                           │
                        (uses)
                           │
┌──────────────────────────▼───────────────────────────┐
│                      numr                            │
│   (tensors, ops, runtime, autograd, linalg, FFT)     │
└──────────────────────────────────────────────────────┘
```

**Design principles:**

- **Extension traits**: ML ops (AttentionOps, RoPEOps) implemented on numr's clients — not new types
- **QuantTensor**: Separate type for quantized data with custom kernels
- **impl_generic**: Composite ops composed from numr primitives, same logic on all backends
- **Custom kernels**: Dequant, quantized matmul, fused attention use per-backend optimizations (SIMD/PTX/WGSL)
- **Vendor-agnostic**: No cuBLAS, cuDNN, or MKL; all native kernels

## Quick Start

### Installation

Add to `Cargo.toml`:

```toml
[dependencies]
boostr = "<latest-version>"

# With CUDA support (requires CUDA 12.x)
# boostr = { version = "0.1", features = ["cuda"] }

# With WebGPU support
# boostr = { version = "0.1", features = ["wgpu"] }
```

### Build

```bash
# CPU build
cargo build --release

# CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

# WebGPU support
cargo build --release --features wgpu

# Run tests
cargo test
cargo test --features cuda
```

### Basic Usage

```rust
use boostr::*;
use boostr::ops::traits::attention::flash::FlashAttentionOps;
use numr::ops::RandomOps;
use numr::runtime::cpu::{CpuClient, CpuDevice};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = CpuClient::new(CpuDevice::new());

    // Create random tensors via numr's RandomOps
    let q = client.randn(&[1, 8, 32, 64], DType::F32)?; // [batch, heads, seq, dim]
    let k = client.randn(&[1, 8, 32, 64], DType::F32)?;
    let v = client.randn(&[1, 8, 32, 64], DType::F32)?;

    // Flash Attention forward pass
    let (output, _lse) = client.flash_attention_fwd(
        &q, &k, &v,
        8,    // num_heads
        8,    // num_kv_heads
        64,   // head_dim
        true, // causal
        0,    // window_size (0 = no sliding window)
        None, // kv_seq_len override
    )?;

    Ok(())
}
```

### Loading a Model

```rust
use boostr::format::Gguf;
use boostr::{CpuRuntime, DType};
use numr::runtime::Runtime;

// Open a GGUF model file (with optional memory mapping)
let mut gguf = Gguf::open("model.gguf")?;
let metadata = gguf.metadata();
let device = <CpuRuntime as Runtime>::Device::default();

// Load tensors — quantized as QuantTensor, others as f32
for name in gguf.tensor_names().map(|s| s.to_string()).collect::<Vec<_>>() {
    let info = gguf.tensor_info(&name)?;
    if info.ggml_type.is_quantized() {
        let qt = gguf.load_tensor_quantized::<CpuRuntime>(&name, &device)?;
    } else {
        let t = gguf.load_tensor_f32::<CpuRuntime>(&name, &device)?;
    }
}
```

### Inference with KV Cache

```rust
use boostr::inference::PagedKvCache;

// Create a paged KV cache for efficient inference
let mut kv_cache = PagedKvCache::new(
    &client,
    num_layers,
    batch_size,
    max_seq_len,
    head_dim,
)?;

// Process tokens with cache
for token_idx in 0..seq_len {
    // ... forward pass using kv_cache ...
    kv_cache.update(layer_idx, &k, &v)?;
}
```

## Feature Flags

| Feature       | Purpose                            | Dependencies            |
| ------------- | ---------------------------------- | ----------------------- |
| `cpu`         | CPU backend (default)              | numr                    |
| `cuda`        | CUDA GPU acceleration (CUDA 12.x)  | numr/cuda, cudarc       |
| `nccl`        | Multi-GPU via NCCL                 | numr/nccl               |
| `wgpu`        | WebGPU cross-platform GPU          | numr/wgpu               |
| `distributed` | Distributed inference over nexar   | nexar, anyhow, bytemuck |
| `f16`         | Half-precision float support       | numr/f16                |
| `fp8`         | FP8 precision support              | numr/fp8                |
| `tts-g2p`     | Grapheme-to-phoneme via espeak-ng¹ | espeakng                |

¹ Requires `libespeak-ng` available at runtime.

## Module Overview

- **`ops/`** — ML-specific operations (attention, RoPE, MoE, etc.)
- **`quant/`** — Quantized tensors and kernels (26 formats)
- **`nn/`** — Neural network modules (Linear, Embedding, LayerNorm, RMSNorm, MoE)
- **`model/`** — Model architectures (LLaMA, Mamba2, Hybrid)
- **`format/`** — Model loaders (SafeTensors, GGUF)
- **`inference/`** — Inference infrastructure (KV cache, scheduling, batching)
- **`optimizer/`** — Training optimizers (AdamW, Lamb, SGD)
- **`trainer/`** — Training utilities and distributed training (ZeRO, tensor/pipeline parallelism)
- **`distributed/`** — Multi-GPU coordination

## Performance

boostr provides production-grade performance through:

- **Fused kernels** — Attention, layer norm, optimizer steps compiled to single kernels
- **Custom quantization** — Per-format SIMD/PTX/WGSL kernels for dequant and quantized matmul
- **Memory efficiency** — Paged KV cache, prefix caching, gradient checkpointing
- **Distributed training** — ZeRO stages, tensor/pipeline parallelism with minimal communication overhead
- **Zero-copy loading** — Memory-mapped GGUF with quantized weights

## Ecosystem

boostr is part of the [ml-rust](https://github.com/ml-rust) organization:

- **[numr](https://github.com/ml-rust/numr)** — Foundational numerical computing (tensors, ops, linalg, FFT)
- **[boostr](https://github.com/ml-rust/boostr)** — ML framework (this project)
- **[oxidizr](https://github.com/ml-rust/oxidizr)** — Training framework for Mamba2, MLA, MoE (uses boostr)
- **[blazr](https://github.com/ml-rust/blazr)** — Inference server with OpenAI-compatible API (uses boostr)
- **[compressr](https://github.com/ml-rust/compressr)** — Model quantization and compression (uses boostr)
- **[splintr](https://github.com/ml-rust/splintr)** — High-performance BPE tokenizer

## Building from Source

### Requirements

- Rust 1.85+
- For CUDA: CUDA 12.x and cudarc dependencies
- For WebGPU: wgpu and platform GPU drivers

### Clone and Build

```bash
git clone https://github.com/ml-rust/boostr.git
cd boostr

# CPU
cargo build --release

# CUDA
cargo build --release --features cuda

# Run tests
cargo test --all-features

# Format and lint
cargo fmt --all
cargo clippy --all-targets
```

## Documentation

- [API Documentation](https://docs.rs/boostr) — Full reference for public API
- [numr Documentation](https://docs.rs/numr) — Tensor and runtime types

## Testing

```bash
# Run all tests
cargo test --all-features

# Specific test suite
cargo test ops::attention --all-features

# Verbose output
cargo test --all-features -- --nocapture
```

## Contributing

Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for architecture conventions, the `impl_generic` pattern, and pull request guidance.

## License

Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.

## Acknowledgments

boostr builds on the numerical foundation provided by [numr](https://github.com/ml-rust/numr) and is designed to power production ML infrastructure across training (oxidizr) and inference (blazr).