ruvllm_sparse_attention
Subquadratic sparse attention for Rust LLM inference on edge devices — runs transformer attention in O(N log N) instead of O(N²), with an optional FastGRNN salience gate that pushes it to near-linear O(N). Pure Rust, zero runtime dependencies, validated on Raspberry Pi 5 and Pi Zero 2W.
TL;DR — A drop-in attention kernel for small language models (SmolLM2, TinyLlama, Phi-2, Mistral-7B, Llama-3) that runs faster on small hardware than dense attention does on big hardware, without giving up quality on the tasks the model was trained for.
Table of contents
- Why sparse attention?
- What this crate gives you
- Quick start (60 seconds)
- Near-linear with FastGRNN gating (new)
- Supported platforms
- Performance
- Feature flags
- How it works
- FAQ
- Tutorial
- License
Why sparse attention?
Attention is the most expensive part of running a transformer language model.
For a sequence of N tokens, plain ("dense") attention does N² comparisons —
every token looks at every other token. That cost is fine on a server with a
GPU, but it's a wall on a Raspberry Pi: at 8192 tokens, dense MHA costs 33.5
million edge evaluations per layer per head.
Sparse attention skips most of those comparisons without changing the
model. You keep the same weights, the same tokenizer, the same vocabulary —
you just don't compute attention edges that wouldn't have mattered. This crate
implements three sparsity primitives that compose into an O(N log N) kernel:
- Local window — every token attends to its
Wnearest neighbors. Captures short-range syntax for free. - Log-stride lookback — each token attends to positions
i-1, i-2, i-4, i-8, .... Captures long-range structure withlog(N)extra edges per token. - Landmark block summaries — each block of
Btokens is summarised by its mean key/value. Far-away tokens are still reachable via their block mean.
For the new FastGRNN salience gate (this release), see Near-linear with FastGRNN gating.
Who is this for?
- Edge AI engineers running on-device inference on Raspberry Pi 5, Pi Zero 2W, Hailo-10H, Jetson Nano, or similar ARM SBCs.
- Rust shops building LLM tooling who want a
cargo-native attention kernel with no Python, no PyTorch, no CUDA, no C++. - Researchers studying sparse / log-linear / sub-quadratic attention variants on small open models (SmolLM2-135M, TinyLlama-1.1B, Phi-2, Mistral-7B, Llama-3).
What this crate is NOT
- It is not a full LLM runtime. You bring your own model weights,
tokenizer, and decode loop. (
ruvllmandcognitum-agentare example consumers.) - It does not train models. Inference only.
- It does not require GPU. CPU-only, NEON / AVX SIMD via Rust iterators.
What this crate gives you
forward(q, k, v)— sparse attention prefill in O(N log N).forward_flash(q, k, v)— same, but tiled like FlashAttention-2 for IO-optimal memory.forward_gqa(q, k, v)— Grouped-Query / Multi-Query Attention for Mistral, Llama-3, TinyLlama.forward_auto(q, k, v)— picks the right path based onq/k/vshapes.decode_step(q, &mut cache)— single-token incremental decode in O(log T) per step against aKvCache.forward_gated(q, k, v, &keep_mask)— NEW. Drop tokens from the long-range candidate set against a binary mask. Combined withFastGrnnGatethis gives anO(N)near-linear path.KvCache/KvCacheF16(featurefp16) — incremental KV cache with H2O heavy-hitter eviction for unbounded generation.
Quick start (60 seconds)
Add the crate:
# Cargo.toml
[]
= { = "https://github.com/ruvnet/RuVector", = ["fp16"] }
Run sparse attention:
use ;
let cfg = default;
let attn = new.unwrap;
// Q/K/V tensors: [seq, heads, head_dim]
let q = zeros;
let k = zeros;
let v = zeros;
let out = attn.forward.unwrap; // O(N log N) prefill
Single-token decode with KV cache (use this for token-by-token generation):
use KvCache;
let mut cache = new;
cache.try_append.unwrap;
let out = attn.decode_step.unwrap; // O(log T) per step
Near-linear with FastGRNN gating (new)
The vanilla sparse path is O(N log N). For very long sequences (8K+ tokens
on a Pi Zero 2W) you can drop the log factor by adding a tiny recurrent gate
that decides which long-range tokens are worth looking at.
FastGrnnGate runs a ~1 KB recurrent cell over the sequence in a single
O(N · D_h²) forward pass and emits a salience score per token. The attention
candidate selector then keeps only the top-K salient tokens (plus the
local window and global tokens, always). Per-position cost is bounded by
window + globals + K — constant in N. Combined cost is O(N).
use ;
let attn = new.unwrap;
let gate = new;
let q = zeros;
let k = zeros;
let v = zeros;
// Gate keeps the top-8 salient long-range tokens per position.
let out = attn.forward_gated_with_fastgrnn.unwrap;
Measured scaling (release build, x86-64, 4 heads × dim 32):
| seq | sparse forward per-token (μs) |
gated forward_gated_with_fastgrnn per-token (μs) |
per-token growth ratio |
|---|---|---|---|
| 128 | 2.1 | 2.9 | — |
| 256 | 2.4 | 3.2 | +14% / +10% |
| 512 | 2.6 | 3.4 | +8% / +6% |
| 1024 | 2.8 | 3.6 | +8% / +6% |
| 2048 | 2.9 | 3.6 | +4% / +0% |
Run the demo yourself:
The gated path's per-token cost flattens (sub-logarithmic growth); the ungated
path's per-token cost grows logarithmically. The crossover point depends on
top_k and window; see the tutorial for tuning guidance.
Causality is preserved. The local window, configured global tokens, and the current position are always retained regardless of the gate's mask. The gate only prunes the long-range log-stride candidate set.
Supported platforms
Validated on:
- Raspberry Pi 5 (Cortex-A76 @ 2.4 GHz, 8 GB) — primary edge target
- Raspberry Pi Zero 2W (Cortex-A53 @ 1 GHz, 512 MB) — minimum edge target via cognitum-agent
- Hailo-10H cluster (4× Pi 5 + Hailo NPU nodes) — production cluster
- x86-64 Linux (AMD Ryzen / Intel) — dev / CI
Cross-compile for Pi 5:
RUSTFLAGS="-C target-cpu=cortex-a76 -C target-feature=+lse,+rcpc,+fp16,+crc" \
CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
Supported models
Pre-validated against the most popular small-and-medium open LLMs:
| Model | Attention type | Q heads | KV heads | forward_auto path |
|---|---|---|---|---|
| Phi-2 | MHA | 32 | 32 | forward / forward_flash |
| Mistral-7B | GQA (4:1) | 32 | 8 | forward_gqa / forward_gqa_flash |
| Llama-3-8B | GQA (4:1) | 32 | 8 | forward_gqa / forward_gqa_flash |
| Llama-3.2-1B | GQA (4:1) | 32 | 8 | forward_gqa / forward_gqa_flash |
| TinyLlama-1.1B | MQA (8:1) | 32 | 4 | forward_gqa / forward_gqa_flash |
| SmolLM2-135M | GQA (3:1) | 9 | 3 | forward_gqa / forward_gqa_flash |
Performance
Config: 8 heads, head_dim = 64, window = 128, block_size = 64, causal, log-stride + landmarks enabled.
| seq | x86-64 (AMD Ryzen) | Pi 5 Cortex-A76 | edge reduction vs dense |
|---|---|---|---|
| 512 | 13.1 ms | 85.8 ms | 2.2× |
| 1024 | 28.4 ms | 190.5 ms | 4.0× |
| 2048 | 60.1 ms | 401.0 ms | 7.7× |
| 4096 | 126.5 ms | 836.2 ms | 15.0× |
| 8192 | 262.6 ms | ~1660 ms (est.) | 29.3× |
KV cache memory at seq = 8192, 8 KV heads, dim = 128:
FP32 GQA: 2.1 GB FP16 GQA: 1.1 GB (52% smaller, feature = "fp16")
Sparse-edge reduction vs causal dense attention:
seq dense sparse reduction
8,192 33,558,528 1,146,498 29.3×
16,384 134,225,920 2,334,274 57.5×
32,768 536,887,296 4,742,658 113.2×
Feature flags
= { .., features = ["parallel", "fp16"] }
| Feature | Effect |
|---|---|
parallel |
Per-head loops via rayon, ~4× prefill speedup on multi-core |
fp16 |
KvCacheF16 — half-precision KV store; halves cache memory vs FP32 |
Default build pulls in zero runtime dependencies (ADR-183).
rand is dev-only.
How it works
The kernel is documented in 8 ADRs in the parent repo:
| Topic | ADR |
|---|---|
| Zero runtime dep footprint | ADR-183 |
| One-pass online softmax (~2× FLOPs cut) | ADR-184 |
| Non-causal landmark block-exclusion fix | ADR-185 |
| 25-test CI suite + 4-node cluster validation | ADR-186 |
| Overflow-checked tensor allocation | ADR-187 |
| Stamp-scheme cross-head dedup safety | ADR-188 |
| KV cache incremental decode | ADR-189 |
| GQA / MQA support | ADR-190 |
| Pi Zero 2W production hardening (proposed) | ADR-191 |
forward_flash implements a 3-phase tiled forward (window → out-of-window
log-stride/landmarks → normalise) so peak intermediate memory is proportional
to tile size rather than full sequence — matches forward to <1e-5.
FAQ
Can I use this with PyTorch / candle / burn?
Yes for the math, but you have to bridge the tensors yourself —
Tensor3 is a flat Vec<f32> with seq × heads × dim layout. There is no
adapter for ndarray, candle, burn, or tch. Most consumers wrap their own
projection step that fills Tensor3 from raw weight bytes (see cognitum-agent).
Does it support training? No. Forward-only. Backprop, optimizers, and gradient accumulation are out of scope.
Does it match the output of dense attention?
Approximately, by design — sparse attention is a deliberate approximation.
Where the local window covers the full sequence (window >= seq), forward
matches dense MHA bit-for-bit. For longer sequences the log-stride and
landmark candidates introduce sparsity-induced error that is empirically <1%
perplexity on standard benchmarks.
What's the difference between forward, forward_gated, and forward_flash?
forward— baseline sparse attention, O(N log N).forward_flash— same math, FlashAttention-2-style tiling for lower peak memory; output identical toforwardto <1e-5.forward_gated— accepts a binarykeep_maskto drop log-stride candidates; combined withFastGrnnGategives O(N) near-linear cost.
Can I run an LLM on a Pi Zero 2W with this?
Yes — production proven. cognitum-agent runs SmolLM2-135M Q4_0 on
cognitum-v0 (Pi Zero 2W, 512 MB) using this crate's forward_gqa_flash +
KvCacheF16. ~1.8 s per token warm.
What models / quantization formats are supported?
The kernel is format-agnostic — it operates on f32 Q/K/V tensors. The
caller dequantizes (Q4_0, Q4_K, Q8_0, Q5_0, Q6_K, F32 are all common) before
calling forward. cognitum-agent's sparse_llm_weights.rs is a working
GGUF dequant implementation.
Is there a Python binding? No, and there are no plans to add one. This crate is intentionally Rust-only — the value is in not depending on a Python runtime.
Tutorial
A hands-on walkthrough — building a streaming summariser on a Pi Zero 2W —
lives in docs/TUTORIAL.md, also published as a
GitHub Gist.
It covers GGUF loading, sparse attention prefill, KV-cache decode, and
FastGRNN-gated near-linear inference end to end.
Build and test
# Native test suite (38 tests including FastGRNN gate)
# With optional features
# Cross-compile for Pi 5 (Cortex-A76)
RUSTFLAGS="-C target-cpu=cortex-a76 -C target-feature=+lse,+rcpc,+fp16,+crc" \
CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
# Benchmark suite (criterion)
# Scaling demo for FastGRNN-gated near-linear path
License
MIT — see LICENSE at the repo root.
Keywords
rust llm inference · sparse attention rust · subquadratic attention ·
near-linear attention · fastgrnn gating · edge ai rust ·
raspberry pi llm · pi zero 2w llm · on-device inference ·
gguf rust · transformer rust · mistral llama smollm2 phi-2 ·
flashattention rust · gqa mqa rust · kv cache fp16