metal-candle
Production-quality Rust ML library for Apple Silicon - LoRA training, text generation, and semantic embeddings
Overview
Pure Rust machine learning library optimized for Apple Silicon:
- LoRA Training: Fine-tune transformer models efficiently
- Text Generation: Streaming, multiple sampling strategies, repetition penalty
- Semantic Embeddings: E5, MiniLM, MPNet models for RAG and search
- Metal Acceleration: Native GPU acceleration on M-series chips
Why metal-candle? 25.9x faster than MLX for embeddings, single binary deployment, type-safe ML, production-ready (407 tests, 81.6% coverage)
Performance
metal-candle demonstrates exceptional performance on Apple Silicon:
| Task | Batch Size | metal-candle | MLX | Speedup |
|---|---|---|---|---|
| Embeddings | 100 docs | 4.4ms | 113.5ms | 25.9x π |
| Embeddings | Single query | 3.9ms | 7.7ms | 2.0x |
| Throughput | - | 22,831 docs/sec | 881 docs/sec | 25.9x |
Near constant-time performance: Batch 1β100 only increases by 13% (3.9ms β 4.4ms)
See BENCHMARKS.md for detailed performance analysis and methodology.
Installation
[]
= "1.2" # or latest from crates.io
Requirements: Rust 1.75+, Apple Silicon (M1/M2/M3/M4), macOS 12.0+
Quick Start
Text Generation
use ;
use Qwen;
// Load model
let model = new?;
// Configure generation
let gen_config = GeneratorConfig ;
// Generate tokens
let mut generator = new?;
let output_ids = generator.generate?;
// Or use streaming for real-time generation (v1.3.0+)
generator.generate_stream?;
// Async streaming (requires 'streaming' feature)
Semantic Embeddings (RAG & Search)
use ;
use Device;
// Load embedding model with Metal acceleration (25.9x faster than MLX!)
let device = new_metal?;
let model = from_pretrained?;
// Generate embeddings for semantic search
let texts = vec!;
let embeddings = model.encode?; // [batch, 384] in 3.9ms
// Batch processing: 100 docs in 4.4ms (22,831 docs/sec throughput)
let large_corpus = load_documents?;
let batch_embeddings = model.encode?;
LoRA Training
use ;
// Create LoRA adapter
let lora_config = LoRAAdapterConfig ;
let adapter = new?;
// Configure and train
let training_config = TrainingConfig ;
let trainer = new?;
let metrics = trainer.train?;
LoRA Adapter Management (v1.3.0+)
use ;
// Create registry for managing multiple adapters
let mut registry = new;
// Load task-specific adapters
let code_adapter = new?;
let chat_adapter = new?;
registry.add_adapter?;
registry.add_adapter?;
// Switch between adapters without reloading base model
registry.activate?;
// ... use model for code generation ...
registry.activate?;
// ... use model for chat ...
// Memory efficient: adapters are ~0.03% of base model size
println!;
Features
- Training: LoRA with dropout, AdamW optimizer, learning rate schedulers, checkpoint management, adapter registry (v1.3.0+)
- Inference: KV-cache, multiple sampling strategies, streaming generation (sync & async), repetition penalty, rich token metadata (v1.3.0+)
- Models: Qwen2.5-Coder, safetensors format, transformer components (RoPE, GQA, MLP)
- Embeddings: E5, MiniLM, MPNet with HuggingFace Hub integration
- Quality: 407 tests, 81.6% coverage, strict clippy linting, 100% API documentation
Architecture
Built on Candle with Metal backend:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β metal-candle (Public API) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Training β Inference β Models β
β β’ LoRAAdapter β β’ KVCache β β’ ModelLoader β
β β’ Trainer β β’ Sampling β β’ Qwen β
β β’ AdamW β β’ Generator β β’ Config β
β β’ Schedulers β β β
β β’ Checkpoint β Embeddings β β
β β β’ EmbeddingModel β β
β β β’ E5/MiniLM/MPNetβ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Candle Framework β
β β’ Tensor operations β’ Metal backend β’ Autograd β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Apple Metal API β
β (GPU acceleration on Apple Silicon) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
See ARCHITECTURE.md for detailed architecture documentation.
Documentation
- API Reference - Complete API documentation
- Architecture Guide - System design and implementation details
- Contributing Guide - Development standards and guidelines
- Benchmarks - Performance analysis and methodology
- Project Plan - Development roadmap and future plans
Examples
| Example | Description |
|---|---|
generate_text.rs |
Text generation with streaming and sampling |
train_lora.rs |
End-to-end LoRA training |
embeddings_demo.rs |
Semantic search with embeddings |
inference_demo.rs |
KV-cache and sampling demo |
load_model.rs |
Model loading and inspection |
Run examples:
Development
&&
See CONTRIBUTING.md for full guidelines. Quality standards: zero clippy warnings (pedantic), β₯80% coverage, 100% API docs.
Roadmap
See ROADMAP.md for detailed release plans and NEXT_STEPS.md for immediate priorities.
Upcoming Releases
- v1.3.1 (Jan 2025): ApplyAdapter implementation, streaming benchmarks
- v1.4.0 (Feb 2025): GGUF format support
- v1.5.0 (Mar 2025): LLaMA/Mistral architectures
- v1.6.0 (Apr 2025): 4-bit/8-bit quantization
- v1.7.0 (May 2025): Flash Attention
- v2.0.0 (Q3 2025): Multi-GPU support
Track progress on the v1.3+ Feature Roadmap project board. Vote with π on issues you'd like to see prioritized!
Contributing
Contributions welcome! See CONTRIBUTING.md for development standards and testing requirements.
License
Licensed under Apache-2.0 (LICENSE). Provides explicit patent protection for production ML.
Acknowledgments
- Built on the excellent Candle framework by Hugging Face
- Inspired by MLX and llama.cpp
- LoRA implementation based on LoRA paper
Known Advisories
Two unmaintained transitive dependencies (non-security): number_prefix, paste from trusted upstream (Candle, HF). See deny.toml for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: ARCHITECTURE.md | CONTRIBUTING.md
Maintained by: @GarthDB