candelabra

candelabra is a small Rust crate for desktop applications that want to run quantized GGUF models (LLaMA, Qwen, Phi, Gemma, etc.) with candle-core, candle-transformers, and hf-hub.

It focuses on the pieces GUI apps usually need:

Hugging Face downloads that respect the local hf-hub cache
tokenizer loading helpers
optional Metal or CUDA device selection with CPU fallback
reusable loaded model state
token streaming with cancellation support
optional profiled inference telemetry for benchmark-style callers

Current Scope

candelabra natively supports quantized GGUF checkpoints with dynamic architecture detection. Supported architectures include:

llama / mistral / mixtral / gemma / gemma2
phi / phi2
phi3
qwen2 (Qwen 2, Qwen 2.5, QwQ)
qwen3
qwen3moe when the qwen3-moe feature is enabled with a Candle build that exposes models::quantized_qwen3_moe
gemma3
glm4
lfm2 (including current LFM2.5 GGUFs that advertise lfm2)
smollm3

Qwen3.5 GGUFs are detected explicitly, but are not run through the Qwen3 backend. They use a newer hybrid Gated DeltaNet + attention architecture that requires a dedicated Candle backend before candelabra can support them safely.

That means the crate is a good fit if you want a lightweight Rust API for local desktop inference on models such as Qwen 2.5 or SmolLM GGUF variants. It abstracts away the candle_transformers::models paths into a single unified Model block.

Installation

Add the crate to your Cargo.toml:

[dependencies]
candelabra = "0.2.0"

By default, candelabra builds CPU-only so library consumers can choose their own Candle backend policy. Enable GPU backends explicitly:

[dependencies]
candelabra = { version = "0.2.0", features = ["metal", "accelerate"] } # macOS

[dependencies]
candelabra = { version = "0.2.0", features = ["cuda"] } # NVIDIA CUDA

[dependencies]
candelabra = { version = "0.2.0", features = ["qwen3-moe"] } # optional Qwen3 MoE GGUF backend

Applications that patch Candle, such as to use CUDA dynamic loading, should keep the [patch.crates-io] entries in the application workspace root and enable the matching candelabra feature there.

Quick Start

use candelabra::{
    download_model,
    load_tokenizer_from_repo,
    run_inference,
    InferenceConfig,
    Model,
};
use std::sync::{
    Arc,
    atomic::AtomicBool,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_path = download_model(
        "Qwen/Qwen2.5-0.5B-Instruct-GGUF",
        "qwen2.5-0.5b-instruct-q4_k_m.gguf",
    )?;
    let tokenizer = load_tokenizer_from_repo("Qwen/Qwen2.5-0.5B-Instruct")?;
    let mut model = Model::load(&model_path)?;
    let cancel_token = Arc::new(AtomicBool::new(false));

    let mut config = InferenceConfig::default();
    config.prompt = "<|im_start|>user\nTell me a story about a helpful robot.<|im_end|>\n<|im_start|>assistant\n".to_string();

    let result = run_inference(
        &mut model,
        &tokenizer,
        &config,
        cancel_token,
        |token| {
            print!("{token}");
            Ok(())
        },
    )?;

    println!("\n{:.2} tokens/s", result.tokens_per_second);
    Ok(())
}

Main API

download_model() downloads a model file through the local Hugging Face cache.
download_model_with_progress() and download_model_with_channel() emit progress updates suitable for UI progress bars.
load_tokenizer_from_repo() downloads and loads tokenizer.json.
Model::load() loads a quantized GGUF model onto the best available device, dynamically instantiating the correct Candle architecture based on metadata.
Model::architecture() returns the GGUF architecture name reported by the model metadata.
Model::reset_state() reloads weights on the selected device to clear backend-owned KV/cache state between isolated runs.
run_inference() streams generated tokens through a callback.
run_inference_profiled() returns ProfiledInferenceResult with prompt-processing, decode, first-token, callback, and stop-reason telemetry for benchmarks.
run_inference_with_channel() streams generated tokens over a Tokio channel.
InferenceConfig::stop_on_eos controls whether generation stops when a common EOS token is sampled; it defaults to true.

Platform Notes

With the metal feature enabled on macOS, the crate prefers Metal and falls back to CPU.
With the cuda feature enabled on non-macOS platforms, the crate prefers CUDA and falls back to CPU.
Without a GPU backend feature, the crate stays CPU-only.
The public device_used string is intended to be easy to surface directly in desktop UIs.

License

Licensed under either of these, at your option:

Apache License, Version 2.0
MIT license