oxbitnet 0.5.2

Run BitNet b1.58 ternary LLMs with wgpu
Documentation

oxbitnet

Run BitNet b1.58 ternary LLMs with wgpu.

Part of 0xBitNet — also available as 0xbitnet (npm) and oxbitnet (Python).

Quick Start

use oxbitnet::BitNet;
use futures::StreamExt;

#[tokio::main]
async fn main() -> oxbitnet::Result<()> {
    let mut model = BitNet::load("model.gguf", Default::default()).await?;

    let options = oxbitnet::GenerateOptions {
        max_tokens: 256,
        temperature: 0.7,
        top_k: 40,
        repeat_penalty: 1.1,
        ..Default::default()
    };

    let mut stream = model.generate_chat(
        &[
            oxbitnet::ChatMessage { role: "system".into(), content: "You are a helpful assistant.".into() },
            oxbitnet::ChatMessage { role: "user".into(), content: "Hello!".into() },
        ],
        options,
    );

    while let Some(token) = stream.next().await {
        print!("{token}");
    }

    model.dispose();
    Ok(())
}

Features

  • Native wgpu — Vulkan, Metal, DX12 backends automatically
  • GGUF loading — Handles I2_S ternary packing (Microsoft BitNet fork)
  • Streaming — Token-by-token via impl Stream<Item = String>
  • Chat templates — Built-in LLaMA 3 chat formatting
  • Disk caching — Models cached at ~/.cache/.0xbitnet/

Example

Interactive chat CLI:

cargo run --example chat --release
cargo run --example chat --release -- --url /path/to/model.gguf

License

MIT