docs.rs failed to build candelabra-0.1.2
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: candelabra-0.1.3

candelabra

candelabra is a small Rust crate for desktop applications that want to run quantized GGUF models (LLaMA, Qwen, Phi, Gemma, etc.) with candle-core, candle-transformers, and hf-hub.

It focuses on the pieces GUI apps usually need:

Hugging Face downloads that respect the local hf-hub cache
tokenizer loading helpers
automatic Metal or CUDA fallback to CPU
reusable loaded model state
token streaming with cancellation support

Current Scope

candelabra natively supports quantized GGUF checkpoints with dynamic architecture detection. Supported architectures include:

llama / mistral / mixtral / gemma / gemma2
phi3
qwen2 (Qwen 2, Qwen 2.5, QwQ)
qwen3 / qwen3moe
gemma3
glm4

That means the crate is a good fit if you want a lightweight Rust API for local desktop inference on models such as Qwen 2.5 or SmolLM GGUF variants. It abstracts away the candle_transformers::models paths into a single unified Model block.

Installation

Add the crate to your Cargo.toml:

[dependencies]
candelabra = "0.1"

Quick Start

use candelabra::{
    download_model,
    load_tokenizer_from_repo,
    run_inference,
    InferenceConfig,
    Model,
};
use std::sync::{
    Arc,
    atomic::AtomicBool,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_path = download_model(
        "Qwen/Qwen2.5-0.5B-Instruct-GGUF",
        "qwen2.5-0.5b-instruct-q4_k_m.gguf",
    )?;
    let tokenizer = load_tokenizer_from_repo("Qwen/Qwen2.5-0.5B-Instruct")?;
    let mut model = Model::load(&model_path)?;
    let cancel_token = Arc::new(AtomicBool::new(false));

    let mut config = InferenceConfig::default();
    config.prompt = "<|im_start|>user\nTell me a story about a helpful robot.<|im_end|>\n<|im_start|>assistant\n".to_string();

    let result = run_inference(
        &mut model,
        &tokenizer,
        &config,
        cancel_token,
        |token| {
            print!("{token}");
            Ok(())
        },
    )?;

    println!("\n{:.2} tokens/s", result.tokens_per_second);
    Ok(())
}

Main API

download_model() downloads a model file through the local Hugging Face cache.
download_model_with_progress() and download_model_with_channel() emit progress updates suitable for UI progress bars.
load_tokenizer_from_repo() downloads and loads tokenizer.json.
Model::load() loads a quantized GGUF model onto the best available device, dynamically instantiating the correct candle architecture base on metadata.
run_inference() streams generated tokens through a callback.
run_inference_with_channel() streams generated tokens over a Tokio channel.

Platform Notes

On macOS, the crate prefers Metal and falls back to CPU.
On non-macOS platforms, the crate prefers CUDA and falls back to CPU.
The public device_used string is intended to be easy to surface directly in desktop UIs.

License

Licensed under either of these, at your option:

Apache License, Version 2.0
MIT license