Crate candelabra

Expand description

Candelabra - a desktop-friendly wrapper around Candle for quantized GGUF models (LLaMA, Qwen, Phi, Gemma).

This crate provides:

Async model downloads with progress reporting
Optional Metal/CUDA device selection with CPU fallback
Reusable model/tokenizer state for repeated inference runs
A small, GUI-friendly API for token streaming and cancellation

§Scope

candelabra supports multi-architecture inference for quantized GGUFs dynamically extracting the architecture string (llama, phi3, qwen2, etc) to invoke the proper candle-transformers backend.

§Example

use candelabra::{download_model, load_tokenizer_from_repo, Model, InferenceConfig, run_inference};
use std::sync::{Arc, atomic::AtomicBool};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_path = download_model(
        "bartowski/SmolLM2-360M-Instruct-GGUF",
        "SmolLM2-360M-Instruct-Q4_K_M.gguf",
    )?;
    let tokenizer = load_tokenizer_from_repo("HuggingFaceTB/SmolLM2-360M-Instruct")?;
    let mut model = Model::load(&model_path)?;
    let cancel_token = Arc::new(AtomicBool::new(false));
    let config = InferenceConfig::default();

    let _result = run_inference(
        &mut model,
        &tokenizer,
        &config,
        cancel_token,
        |_| Ok(()),
    )?;

    Ok(())
}

Structs§

DownloadProgress: Progress information emitted during model downloads.
InferenceConfig: Configuration for inference runs.
InferenceResult: Result produced by a completed inference run.
InferenceTelemetry: Fine-grained timing and throughput telemetry for an inference run.
Model: A loaded model ready for inference.
ProfiledInferenceResult: Result produced by a profiled inference run.

Enums§

CandelabraError: Error type for all candelabra operations.
DeviceType: The type of compute device available for inference.
StopReason: Reason an inference run stopped.

Functions§

check_model_cached: Checks if a model file is already present in the local Hugging Face cache.
download_model: Downloads a model file via hf-hub and returns the local cached path.
download_model_with_channel: Downloads a model file with progress reporting via Tokio channel.
download_model_with_progress: Downloads a model file with progress reporting via callback.
download_tokenizer: Downloads a tokenizer file via hf-hub and returns the local cached path.
download_tokenizer_with_channel: Downloads a tokenizer file with progress reporting via Tokio channel.
download_tokenizer_with_progress: Downloads a tokenizer file with progress reporting via callback.
get_best_device: Returns the best available compute device with automatic fallback.
get_device: Returns the best available compute device with detailed error information.
load_tokenizer: Loads a tokenizer from disk.
load_tokenizer_from_repo: Downloads and loads a tokenizer from the Hugging Face cache.
run_inference: Runs inference using reusable model and tokenizer state.
run_inference_profiled: Runs inference using reusable model and tokenizer state, returning detailed telemetry.
run_inference_with_channel: Runs inference and streams tokens over a Tokio channel.

Type Aliases§

Result

Crate candelabra

Crate candelabra Copy item path

§Scope

§Example

Structs§

Enums§

Functions§

Type Aliases§

Crate candelabra