# BitMamba
A 1.58-bit Mamba language model with **infinite context window**, implemented in Rust.
[](https://crates.io/crates/bitmamba)
[](https://docs.rs/bitmamba)
[](https://opensource.org/licenses/MIT)
## Features
- **Infinite Context Window** - Mamba's SSM maintains fixed-size state regardless of sequence length
- **1.58-bit Weights** - BitNet-style quantization for efficient inference
- **CPU Inference** - No GPU required
- **OpenAI-Compatible API** - Drop-in replacement for OpenAI API, works with Cline, Continue, etc.
- **Streaming Support** - Server-Sent Events for real-time token generation
## Installation
```bash
cargo install bitmamba
```
Or build from source:
```bash
git clone https://github.com/rileyseaburg/bitmamba
cd bitmamba
cargo build --release
```
## Usage
### CLI
```bash
# Run inference directly
bitmamba
```
### OpenAI-Compatible Server
```bash
# Start the server
bitmamba-server
```
The server runs at `http://localhost:8000` with these endpoints:
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | Chat completions (streaming supported) |
| `/v1/completions` | POST | Text completions |
| `/health` | GET | Health check |
### Configure with Cline/Continue
```json
{
"apiProvider": "openai-compatible",
"baseUrl": "http://localhost:8000/v1",
"model": "bitmamba-student"
}
```
### As a Library
```rust
use bitmamba::{BitMambaStudent, load_model, load_tokenizer};
fn main() -> anyhow::Result<()> {
let (model, tokenizer) = bitmamba::load()?;
let prompt = "def fibonacci(n):";
let tokens = tokenizer.encode(prompt, true)?;
let output = model.generate(tokens.get_ids(), 50, 0.7)?;
println!("{}", tokenizer.decode(&output, true)?);
Ok(())
}
```
## Model
The default model is [`rileyseaburg/bitmamba-student`](https://huggingface.co/rileyseaburg/bitmamba-student) on Hugging Face, a 278M parameter BitMamba model distilled from Qwen2.5-Coder-1.5B.
### Architecture
- **Hidden Size**: 768
- **Layers**: 12 BitMamba blocks
- **State Size**: 16 (SSM state dimension)
- **Expand Factor**: 2
- **Vocab Size**: 151,665 (Qwen tokenizer)
### BitMamba Block
```
Input -> RMSNorm -> BitLinear (in_proj) -> Conv1d -> SiLU -> SSM Scan -> Gate -> BitLinear (out_proj) -> Residual
```
The **SSM Scan** is the key component that enables infinite context:
```rust
// Fixed-size state, O(1) memory per token
h = dA * h + dB * x // State update
y = h @ C + D * x // Output
```
## Performance
| Parameters | 278M |
| Memory (inference) | ~1.1 GB |
| Context Window | Unlimited |
| Quantization | 1.58-bit weights |
## Citation
If you use BitMamba in your research, please cite:
```bibtex
@software{bitmamba2024,
author = {Seaburg, Riley},
title = {BitMamba: 1.58-bit Mamba with Infinite Context},
year = {2024},
url = {https://github.com/rileyseaburg/bitmamba}
}
```
## Related Work
- [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752)
- [BitNet: Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
- [The Era of 1-bit LLMs](https://arxiv.org/abs/2402.17764)
## License
MIT License - see [LICENSE](LICENSE) for details.