boostr
ML framework built on numr — attention, quantization, model architectures.
boostr extends numr with production-grade ML primitives. It provides attention mechanisms, quantization support, model architectures, and inference infrastructure — all built on numr's foundational tensors, runtimes, and ops. No reimplementation. No wrappers. Pure extension traits.
Key Capabilities
Quantization
- 26 formats (GGUF-compatible): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K–Q8K, IQ1S–IQ4XS, TQ1_0, TQ2_0
- QuantTensor type for block-quantized data
- Per-backend kernels: Native SIMD (CPU), PTX (CUDA), WGSL (WebGPU)
- Zero-copy GGUF loading with memory mapping
Attention
- Flash Attention v2/v3 with fused QKV projection
- Multi-Head Latent Attention (MLA) — compressed KV cache
- Grouped Query Attention (GQA) and multi-head variants
- Paged attention for memory-efficient inference
- Variable-length attention with ragged tensors
- Prefix caching for context reuse
Position Encodings
- RoPE: Split-half, interleaved, ALiBi variants
- YaRN for length extrapolation
- Efficient fused implementations on all backends
Model Architectures
- LLaMA — standard and tensor-parallelized
- Mamba2 — state space models with SSD kernels
- Hybrid — mixed transformer/SSM models
- Extensible architecture system for custom models
Neural Network Modules
- Linear — standard and quantized variants
- Embedding for token embeddings
- LayerNorm, RMSNorm with fused implementations
- MoE layers with expert routing and load balancing
Inference Infrastructure
- Paged KV cache with block allocator for memory efficiency
- Request scheduler with continuous batching
- Prefix caching for prompt reuse
- Speculative decoding with adaptive draft depth and verification kernels
- Flash decoding for single-token decode (CUDA, auto-selected when S_q=1)
Training
- Optimizers: AdamW, Lamb, SGD with gradient clipping
- Mixed precision (AMP) with automatic loss scaling
- Gradient accumulation and checkpointing
- Learning rate scheduling (warmup, cosine, linear decay)
- Distributed training:
- ZeRO stage 1/2/3 (parameter/gradient/optimizer sharding)
- Tensor parallelism with communicators
- Pipeline parallelism (1F1B, Gpipe, ZeroBubble schedules)
Model Loading
- SafeTensors: Zero-copy memory-mapped loading
- GGUF: Full format support with block-quantized tensors
- Format auto-detection
Multi-Backend
- CPU: SIMD kernels (AVX2, NEON), native ops
- CUDA: PTX kernels, Flash Attention v2/v3, fused ops (CUDA 12.x)
- WebGPU: WGSL shaders, cross-platform GPU support
Architecture
┌──────────────────────────────────────────────────────┐
│ boostr │
│ (attention, RoPE, MoE, quantization, model loaders) │
└──────────────────────────┬──────────────────────────┘
│
(uses)
│
┌──────────────────────────▼──────────────────────────┐
│ numr │
│ (tensors, ops, runtime, autograd, linalg, FFT) │
└──────────────────────────────────────────────────────┘
Design principles:
- Extension traits: ML ops (AttentionOps, RoPEOps) implemented on numr's clients — not new types
- QuantTensor: Separate type for quantized data with custom kernels
- impl_generic: Composite ops composed from numr primitives, same logic on all backends
- Custom kernels: Dequant, quantized matmul, fused attention use per-backend optimizations (SIMD/PTX/WGSL)
- Vendor-agnostic: No cuBLAS, cuDNN, or MKL; all native kernels
Quick Start
Installation
Add to Cargo.toml:
[]
= "<latest-version>"
# With CUDA support (requires CUDA 12.x)
# boostr = { version = "0.1", features = ["cuda"] }
# With WebGPU support
# boostr = { version = "0.1", features = ["wgpu"] }
Build
# CPU build
# CUDA support (requires CUDA 12.x)
# WebGPU support
# Run tests
Basic Usage
use *;
use RandomOps;
use CpuClient;
Loading a Model
use Gguf;
use ;
use Runtime;
// Open a GGUF model file (with optional memory mapping)
let mut gguf = open?;
let metadata = gguf.metadata;
let device = default;
// Load tensors — quantized as QuantTensor, others as f32
for name in gguf.tensor_names.map.
Inference with KV Cache
use PagedKvCache;
// Create a paged KV cache for efficient inference
let mut kv_cache = new?;
// Process tokens with cache
for token_idx in 0..seq_len
Feature Flags
| Feature | Purpose | Dependencies |
|---|---|---|
cpu |
CPU backend (default) | numr |
cuda |
CUDA GPU acceleration | numr/cuda, cudarc |
nccl |
Multi-GPU via NCCL | numr/nccl |
wgpu |
WebGPU cross-platform GPU | numr/wgpu |
f16 |
Half-precision float support | numr/f16 |
fp8 |
FP8 precision support | numr/fp8 |
Module Overview
ops/— ML-specific operations (attention, RoPE, MoE, etc.)quant/— Quantized tensors and kernels (26 formats)nn/— Neural network modules (Linear, Embedding, LayerNorm, RMSNorm, MoE)model/— Model architectures (LLaMA, Mamba2, Hybrid)format/— Model loaders (SafeTensors, GGUF)inference/— Inference infrastructure (KV cache, scheduling, batching)optimizer/— Training optimizers (AdamW, Lamb, SGD)trainer/— Training utilities and distributed training (ZeRO, tensor/pipeline parallelism)distributed/— Multi-GPU coordination
Performance
boostr provides production-grade performance through:
- Fused kernels — Attention, layer norm, optimizer steps compiled to single kernels
- Custom quantization — Per-format SIMD/PTX/WGSL kernels for dequant and quantized matmul
- Memory efficiency — Paged KV cache, prefix caching, gradient checkpointing
- Distributed training — ZeRO stages, tensor/pipeline parallelism with minimal communication overhead
- Zero-copy loading — Memory-mapped GGUF with quantized weights
Ecosystem
boostr is part of the ml-rust organization:
- numr — Foundational numerical computing (tensors, ops, linalg, FFT)
- boostr — ML framework (this project)
- oxidizr — Training framework for Mamba2, MLA, MoE (uses boostr)
- blazr — Inference server with OpenAI-compatible API (uses boostr)
- compressr — Model quantization and compression (uses boostr)
- splintr — High-performance BPE tokenizer
Building from Source
Requirements
- Rust 1.85+
- For CUDA: CUDA 12.x and cudarc dependencies
- For WebGPU: wgpu and platform GPU drivers
Clone and Build
# CPU
# CUDA
# Run tests
# Format and lint
Documentation
- API Documentation — Full reference for public API
- numr Documentation — Tensor and runtime types
Testing
# Run all tests
# Specific test suite
# Verbose output
Contributing
Contributions are welcome! Please see the main repository's contribution guidelines.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Acknowledgments
boostr builds on the numerical foundation provided by numr and is designed to power production ML infrastructure across training (oxidizr) and inference (blazr).