bitmamba 0.1.0

BitMamba: 1.58-bit Mamba language model with infinite context window - includes OpenAI-compatible API server
Documentation
# BitMamba

A 1.58-bit Mamba language model with **infinite context window**, implemented in Rust.

[![Crates.io](https://img.shields.io/crates/v/bitmamba.svg)](https://crates.io/crates/bitmamba)
[![Documentation](https://docs.rs/bitmamba/badge.svg)](https://docs.rs/bitmamba)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Features

- **Infinite Context Window** - Mamba's SSM maintains fixed-size state regardless of sequence length
- **1.58-bit Weights** - BitNet-style quantization for efficient inference  
- **CPU Inference** - No GPU required
- **OpenAI-Compatible API** - Drop-in replacement for OpenAI API, works with Cline, Continue, etc.
- **Streaming Support** - Server-Sent Events for real-time token generation

## Installation

```bash
cargo install bitmamba
```

Or build from source:

```bash
git clone https://github.com/rileyseaburg/bitmamba
cd bitmamba
cargo build --release
```

## Usage

### CLI

```bash
# Run inference directly
bitmamba
```

### OpenAI-Compatible Server

```bash
# Start the server
bitmamba-server
```

The server runs at `http://localhost:8000` with these endpoints:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | Chat completions (streaming supported) |
| `/v1/completions` | POST | Text completions |
| `/health` | GET | Health check |

### Configure with Cline/Continue

```json
{
  "apiProvider": "openai-compatible",
  "baseUrl": "http://localhost:8000/v1",
  "model": "bitmamba-student"
}
```

### As a Library

```rust
use bitmamba::{BitMambaStudent, load_model, load_tokenizer};

fn main() -> anyhow::Result<()> {
    let (model, tokenizer) = bitmamba::load()?;
    
    let prompt = "def fibonacci(n):";
    let tokens = tokenizer.encode(prompt, true)?;
    let output = model.generate(tokens.get_ids(), 50, 0.7)?;
    
    println!("{}", tokenizer.decode(&output, true)?);
    Ok(())
}
```

## Model

The default model is [`rileyseaburg/bitmamba-student`](https://huggingface.co/rileyseaburg/bitmamba-student) on Hugging Face, a 278M parameter BitMamba model distilled from Qwen2.5-Coder-1.5B.

### Architecture

- **Hidden Size**: 768
- **Layers**: 12 BitMamba blocks
- **State Size**: 16 (SSM state dimension)
- **Expand Factor**: 2
- **Vocab Size**: 151,665 (Qwen tokenizer)

### BitMamba Block

```
Input -> RMSNorm -> BitLinear (in_proj) -> Conv1d -> SiLU -> SSM Scan -> Gate -> BitLinear (out_proj) -> Residual
```

The **SSM Scan** is the key component that enables infinite context:

```rust
// Fixed-size state, O(1) memory per token
h = dA * h + dB * x  // State update
y = h @ C + D * x     // Output
```

## Performance

| Metric | Value |
|--------|-------|
| Parameters | 278M |
| Memory (inference) | ~1.1 GB |
| Context Window | Unlimited |
| Quantization | 1.58-bit weights |

## Citation

If you use BitMamba in your research, please cite:

```bibtex
@software{bitmamba2024,
  author = {Seaburg, Riley},
  title = {BitMamba: 1.58-bit Mamba with Infinite Context},
  year = {2024},
  url = {https://github.com/rileyseaburg/bitmamba}
}
```

## Related Work

- [Mamba: Linear-Time Sequence Modeling with Selective State Spaces]https://arxiv.org/abs/2312.00752
- [BitNet: Scaling 1-bit Transformers for Large Language Models]https://arxiv.org/abs/2310.11453
- [The Era of 1-bit LLMs]https://arxiv.org/abs/2402.17764

## License

MIT License - see [LICENSE](LICENSE) for details.