BitMamba

A 1.58-bit Mamba language model with infinite context window, implemented in Rust.

Features

Infinite Context Window - Mamba's SSM maintains fixed-size state regardless of sequence length
1.58-bit Weights - BitNet-style quantization for efficient inference
CPU Inference - No GPU required
OpenAI-Compatible API - Drop-in replacement for OpenAI API, works with Cline, Continue, etc.
Streaming Support - Server-Sent Events for real-time token generation

Installation

cargo install bitmamba

Or build from source:

git clone https://github.com/rileyseaburg/bitmamba
cd bitmamba
cargo build --release

Usage

CLI

# Run inference directly
bitmamba

OpenAI-Compatible Server

# Start the server
bitmamba-server

The server runs at http://localhost:8000 with these endpoints:

Endpoint	Method	Description
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completions (streaming supported)
`/v1/completions`	POST	Text completions
`/health`	GET	Health check

Configure with Cline/Continue

{
  "apiProvider": "openai-compatible",
  "baseUrl": "http://localhost:8000/v1",
  "model": "bitmamba-student"
}

As a Library

use bitmamba::{BitMambaStudent, load_model, load_tokenizer};

fn main() -> anyhow::Result<()> {
    let (model, tokenizer) = bitmamba::load()?;
    
    let prompt = "def fibonacci(n):";
    let tokens = tokenizer.encode(prompt, true)?;
    let output = model.generate(tokens.get_ids(), 50, 0.7)?;
    
    println!("{}", tokenizer.decode(&output, true)?);
    Ok(())
}

Model

The default model is rileyseaburg/bitmamba-student on Hugging Face, a 278M parameter BitMamba model distilled from Qwen2.5-Coder-1.5B.

Architecture

Hidden Size: 768
Layers: 12 BitMamba blocks
State Size: 16 (SSM state dimension)
Expand Factor: 2
Vocab Size: 151,665 (Qwen tokenizer)

BitMamba Block

Input -> RMSNorm -> BitLinear (in_proj) -> Conv1d -> SiLU -> SSM Scan -> Gate -> BitLinear (out_proj) -> Residual

The SSM Scan is the key component that enables infinite context:

// Fixed-size state, O(1) memory per token
h = dA * h + dB * x  // State update
y = h @ C + D * x     // Output

Performance

Metric	Value
Parameters	278M
Memory (inference)	~1.1 GB
Context Window	Unlimited
Quantization	1.58-bit weights

Citation

If you use BitMamba in your research, please cite:

@software{bitmamba2024,
  author = {Seaburg, Riley},
  title = {BitMamba: 1.58-bit Mamba with Infinite Context},
  year = {2024},
  url = {https://github.com/rileyseaburg/bitmamba}
}

Related Work

License

MIT License - see LICENSE for details.

bitmamba 0.1.0