metal-candle 1.2.0

Production-quality Rust ML crate for Apple Silicon - LoRA training, inference, and text generation using Candle with Metal backend
docs.rs failed to build metal-candle-1.2.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: metal-candle-1.3.0

metal-candle

CI codecov License Rust

Production-quality Rust ML crate for Apple Silicon - LoRA training, inference, text generation, and semantic embeddings using Candle with Metal backend

Overview

metal-candle is a pure Rust machine learning library designed specifically for Apple Silicon, providing production-ready tools for:

  • LoRA Training: Fine-tune transformer models efficiently using Low-Rank Adaptation
  • Model Loading: Safetensors format with comprehensive validation
  • Text Generation: High-level Generator API with streaming, repetition penalty, and stop conditions
  • Semantic Embeddings: Sentence-transformers (E5, MiniLM, MPNet) for RAG and search
  • Metal Acceleration: Native Metal backend for optimal M-series chip performance

Why metal-candle?

  • 25.9x Faster than MLX: Beats Apple's official ML framework for embeddings
  • Single Binary: No Python runtime or virtual environments required
  • Pure Rust: Type-safe ML with compile-time guarantees
  • Production Ready: 216 tests, 84.7% coverage, 100% API documentation
  • Ergonomic API: Builder patterns, sensible defaults, clear error messages

Performance

metal-candle demonstrates exceptional performance on Apple Silicon:

Task Batch Size metal-candle MLX Speedup
Embeddings 100 docs 4.4ms 113.5ms 25.9x πŸš€
Embeddings Single query 3.9ms 7.7ms 2.0x
Throughput - 22,831 docs/sec 881 docs/sec 25.9x

Near constant-time performance: Batch 1β†’100 only increases by 13% (3.9ms β†’ 4.4ms)

See BENCHMARKS.md for detailed performance analysis and methodology.

Installation

Add to your Cargo.toml:

[dependencies]
metal-candle = "1.2"

Requirements: Rust 1.75+, Apple Silicon Mac (M1/M2/M3/M4), macOS 12.0+

Quick Start

Text Generation

use metal_candle::inference::{Generator, GeneratorConfig, SamplingStrategy};
use metal_candle::models::Qwen;

// Load model
let model = Qwen::new(&config, vb)?;

// Configure generation
let gen_config = GeneratorConfig {
    max_tokens: 128,
    sampling: SamplingStrategy::TopP { p: 0.95 },
    temperature: 0.7,
    repetition_penalty: 1.1,  // Reduce repetition
    stop_on_eos: true,
    eos_token_id: Some(151643),  // Qwen EOS token
    ..Default::default()
};

// Generate tokens
let mut generator = Generator::new(Box::new(model), gen_config)?;
let output_ids = generator.generate(&input_ids)?;

// Or use streaming for real-time generation
generator.generate_stream(&input_ids, |token| {
    print!("{} ", token);
    true // Continue generation
})?;

Semantic Embeddings (RAG & Search)

use metal_candle::embeddings::{EmbeddingModel, EmbeddingModelType};
use metal_candle::Device;

// Load embedding model with Metal acceleration (25.9x faster than MLX!)
let device = Device::new_metal(0)?;
let model = EmbeddingModel::from_pretrained(
    EmbeddingModelType::E5SmallV2,
    device,
)?;

// Generate embeddings for semantic search
let texts = vec![
    "Rust is a systems programming language",
    "Python is a high-level language",
];
let embeddings = model.encode(&texts)?;  // [batch, 384] in 3.9ms

// Batch processing: 100 docs in 4.4ms (22,831 docs/sec throughput)
let large_corpus = load_documents()?;
let batch_embeddings = model.encode(&large_corpus)?;

LoRA Training

use metal_candle::training::{
    LoRAAdapter, LoRAAdapterConfig, TargetModule,
    Trainer, TrainingConfig, LRScheduler
};

// Create LoRA adapter
let lora_config = LoRAAdapterConfig {
    rank: 8,
    alpha: 16.0,
    dropout: 0.0,
    target_modules: vec![TargetModule::QProj, TargetModule::VProj],
};
let adapter = LoRAAdapter::new(&model, lora_config, &device)?;

// Configure and train
let training_config = TrainingConfig {
    num_epochs: 3,
    lr_scheduler: LRScheduler::warmup_cosine(100, 1000, 1e-4, 1e-6),
    ..Default::default()
};
let trainer = Trainer::new(adapter, training_config)?;
let metrics = trainer.train(&dataset)?;

Features

Training: LoRA layers with dropout, AdamW optimizer, LR schedulers (Constant, Linear, Cosine, WarmupCosine), checkpoint management, gradient flow, cross-entropy loss with label smoothing

Inference: KV-cache (~173 MB for 2048 tokens), multiple sampling strategies (Greedy, Top-k, Top-p, Temperature), repetition penalty, streaming generation with callbacks, stop conditions (EOS tokens, custom tokens)

Models: Qwen2.5-Coder architecture, safetensors format, transformer components (RoPE, GQA, MLP), builder pattern with dtype conversion

Embeddings (feature: embeddings): Sentence transformers (E5-small-v2, MiniLM-L6-v2, MPNet-base-v2), HuggingFace Hub integration, mean pooling, L2 normalization, Metal acceleration

Quality: 254 tests (179 lib + 75 doc), β‰₯80% code coverage enforced, strict clippy pedantic linting, 100% API documentation, CI/CD on Apple Silicon

Architecture

Built on Candle with Metal backend:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    metal-candle (Public API)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Training          β”‚  Inference        β”‚  Models            β”‚
β”‚  β€’ LoRAAdapter     β”‚  β€’ KVCache        β”‚  β€’ ModelLoader     β”‚
β”‚  β€’ Trainer         β”‚  β€’ Sampling       β”‚  β€’ Qwen           β”‚
β”‚  β€’ AdamW           β”‚  β€’ Generator      β”‚  β€’ Config          β”‚
β”‚  β€’ Schedulers      β”‚                   β”‚                    β”‚
β”‚  β€’ Checkpoint      β”‚  Embeddings       β”‚                    β”‚
β”‚                    β”‚  β€’ EmbeddingModel β”‚                    β”‚
β”‚                    β”‚  β€’ E5/MiniLM/MPNetβ”‚                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Candle Framework                        β”‚
β”‚  β€’ Tensor operations  β€’ Metal backend  β€’ Autograd           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Apple Metal API                         β”‚
β”‚  (GPU acceleration on Apple Silicon)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

See ARCHITECTURE.md for detailed architecture documentation.

Documentation

Examples

Example Description
generate_text.rs Text generation with streaming and sampling
train_lora.rs End-to-end LoRA training
embeddings_demo.rs Semantic search with embeddings
inference_demo.rs KV-cache and sampling demo
load_model.rs Model loading and inspection

Run examples:

cargo run --example generate_text
cargo run --example train_lora
cargo run --example embeddings_demo --features embeddings

Development

See CONTRIBUTING.md for detailed development setup, testing guidelines, and coding standards.

Quick start:

# Clone and build
git clone https://github.com/GarthDB/metal-candle.git
cd metal-candle
cargo build

# Run tests and checks
cargo test
cargo clippy -- -D warnings
cargo fmt --check

Quality standards enforced: Zero clippy warnings (pedantic), β‰₯80% code coverage, 100% API documentation, all tests passing.

Roadmap

v1.1 βœ… Complete

  • βœ… Foundation & Metal Backend
  • βœ… Model Loading & Architecture (Qwen2.5-Coder)
  • βœ… LoRA Training Pipeline
  • βœ… Inference & Text Generation
  • βœ… High-level Generator API
  • βœ… Advanced sampling strategies with repetition penalty
  • βœ… Streaming generation with callbacks
  • βœ… Semantic embeddings (E5, MiniLM, MPNet)
  • βœ… Quality & Documentation

v1.2+ (Future)

  • Generator KV-cache optimization (incremental token passing)
  • Custom fused softmax kernel (Issue #27)
  • GGUF format support
  • Additional model architectures (LLaMA, Mistral)
  • Quantization (4-bit, 8-bit)
  • Flash Attention integration
  • Multi-GPU support

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for code quality standards, testing requirements, and PR process.

Quick checklist:

  • cargo clippy -- -D warnings passes
  • cargo test passes
  • cargo fmt applied
  • New code has tests
  • Public APIs documented
  • No unwrap() in library code

License

Licensed under the Apache License, Version 2.0 (LICENSE or http://www.apache.org/licenses/LICENSE-2.0).

The Apache License provides explicit patent protection, which is important for production machine learning libraries.

Acknowledgments

Known Advisories

This project has two transitive dependencies flagged as unmaintained (not security issues):

  • number_prefix (via hf-hub β†’ indicatif)
  • paste (via candle-core β†’ gemm/metal)

These are from major, trusted dependencies (Candle, HuggingFace) and pose no security risk. They will be resolved when upstream updates. See deny.toml for details.

Support


Status: βœ… v1.1.0 Released - Production Ready
Maintained by: @GarthDB
License: Apache-2.0