BitMamba
A 1.58-bit Mamba language model with infinite context window, implemented in Rust.
Features
- Infinite Context Window - Mamba's SSM maintains fixed-size state regardless of sequence length
- 1.58-bit Weights - BitNet-style quantization for efficient inference
- CPU Inference - No GPU required
- OpenAI-Compatible API - Drop-in replacement for OpenAI API, works with Cline, Continue, etc.
- Streaming Support - Server-Sent Events for real-time token generation
Installation
Or build from source:
Usage
CLI
# Run inference directly
OpenAI-Compatible Server
# Start the server
The server runs at http://localhost:8000 with these endpoints:
| Endpoint | Method | Description |
|---|---|---|
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completions (streaming supported) |
/v1/completions |
POST | Text completions |
/health |
GET | Health check |
Configure with Cline/Continue
As a Library
use ;
Model
The default model is rileyseaburg/bitmamba-student on Hugging Face, a 278M parameter BitMamba model distilled from Qwen2.5-Coder-1.5B.
Architecture
- Hidden Size: 768
- Layers: 12 BitMamba blocks
- State Size: 16 (SSM state dimension)
- Expand Factor: 2
- Vocab Size: 151,665 (Qwen tokenizer)
BitMamba Block
Input -> RMSNorm -> BitLinear (in_proj) -> Conv1d -> SiLU -> SSM Scan -> Gate -> BitLinear (out_proj) -> Residual
The SSM Scan is the key component that enables infinite context:
// Fixed-size state, O(1) memory per token
h = dA * h + dB * x // State update
y = h @ C + D * x // Output
Performance
| Metric | Value |
|---|---|
| Parameters | 278M |
| Memory (inference) | ~1.1 GB |
| Context Window | Unlimited |
| Quantization | 1.58-bit weights |
Citation
If you use BitMamba in your research, please cite:
Related Work
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- BitNet: Scaling 1-bit Transformers for Large Language Models
- The Era of 1-bit LLMs
License
MIT License - see LICENSE for details.