oxibonsai-model
Qwen3 Transformer implementation for 1-bit and ternary Bonsai inference.
Implements the full autoregressive forward pass for the Qwen3 architecture family
(Bonsai-8B/4B/1.7B in Q1_0_g128 and TernaryBonsai-8B/4B/1.7B in TQ2) — token
embedding, Grouped Query Attention with RoPE, SwiGLU MLP, RMSNorm, paged
KV-cache, and Metal/CUDA full-forward integration via oxibonsai-kernels.
Status: Stable — 673 tests passing (cargo nextest run -p oxibonsai-model)
Version: 0.1.4
Part of the OxiBonsai project.
Features
Core Transformer
BonsaiModel— full Qwen3 forward pass: token embedding → N transformer blocks → final RMSNorm → LM headTransformerBlock— attention sublayer + SwiGLU FFN sublayer with residual connections (block/)- RMSNorm, RoPE (base=1M), Grouped Query Attention (32 Q / 8 KV heads, head_dim=128)
- SwiGLU FFN (gate/up → SiLU × gate → down projection)
CausalMaskand sliding-window attention
Model Variants & Registry
ModelVariant::Bonsai{8B, 4B, 1_7B}— Q1_0_g128 1-bit weightsModelVariant::TernaryBonsai{8B, 4B, 1_7B}— TQ2 ternary weightsModelSpec,CapabilityProfile,all_specs()inmodel_variants.rs- Architecture auto-detection from GGUF metadata in
model_registry.rs Qwen3Config+ModelConfigBuilderfor custom configs
Weight Loading
- GGUF loader with tensor-name mapping (
gguf_loader.rs,convert/name_map.rs) - Q1 loader path via
oxibonsai-kernelsblocks LinearTernarylayer +load_ternary_blocks+load_ternary_embedding+OutputWeight::Ternary(TQ2)- Safetensors loading support
KV Cache
- Standard
KvCachewith position-indexed storage PagedKvCache— vLLM-style block-based cache for high utilizationKvCacheFp16— K/V stored in f16 (halves cache memory)KvCacheQuant— Q8/Q4 quantized KV cache
Advanced Attention
- Flash attention / fused kernel (
attention_fused.rs) - Flash decoding (
flash_decode.rs) - Attention sink (
attention_sink.rs) - Cross-attention (
cross_attention.rs) - Sparse attention: local window, BigBird, Longformer, dilated (
sparse_attention.rs) - ALiBi and YaRN RoPE variants
- RoPE scaling: YaRN, linear, DynamicNTK, LLaMA 3.1, LongRoPE
Training & Fine-tuning Utilities
- LoRA + LoRA trainer (
lora.rs,lora_trainer.rs) - Mixture-of-Experts:
MoeRouter,MoeExpert, Mixture-of-Depths - Optimizers, LR schedulers, losses, gradient / gradient checkpointing
- Pruning, calibration
Quantization & Export
- Dynamic quantization: DynamicQ8_0, DynamicQ4_0, DynamicQ4_1
quantize_int8.rs,quantize_ternary.rsexportersExportFormat::TernaryG128inexport.rs- Checkpoint save/load (OXCK binary format)
Model Conversion
- ONNX MatMulNBits (bits=2) ingestion via
oxibonsai convert --onnx— reads onnx-community Ternary releases and repacks as GGUF TQ2_0_g128 - Qwen3 ONNX tensor role mapping: automatically maps ONNX node names to the Qwen3 weight layout (embedding, QKV projections, gate/up/down FFN, RMSNorm scales)
Scaling & Inference
- Tensor parallelism, pipeline parallelism, multi-GPU utilities
- Chunked prefill, prefix cache, disk cache
- Weight tying (
TiedEmbedding) for embedding/LM head sharing - Model merging: SLERP, TIES, DARE, task vector
- Speculative draft model support
- Compression utilities
Feature Flags
| Flag | Description |
|---|---|
wasm |
WASM-safe build |
metal |
Metal GPU backend (macOS) — full-forward integration with oxibonsai-kernels |
native-cuda |
CUDA GPU backend (NVIDIA) |
Usage
[]
= "0.1.4"
License
Apache-2.0 — COOLJAPAN OU