Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
metal-candle
Production-quality Rust ML crate for Apple Silicon - LoRA training, inference, text generation, and semantic embeddings using Candle with Metal backend
Overview
metal-candle is a pure Rust machine learning library designed specifically for Apple Silicon, providing production-ready tools for:
- LoRA Training: Fine-tune transformer models efficiently using Low-Rank Adaptation
- Model Loading: Safetensors format with comprehensive validation
- Text Generation: High-level Generator API with streaming, repetition penalty, and stop conditions
- Semantic Embeddings: Sentence-transformers (E5, MiniLM, MPNet) for RAG and search
- Metal Acceleration: Native Metal backend for optimal M-series chip performance
Why metal-candle?
- 25.9x Faster than MLX: Beats Apple's official ML framework for embeddings
- Single Binary: No Python runtime or virtual environments required
- Pure Rust: Type-safe ML with compile-time guarantees
- Production Ready: 216 tests, 84.7% coverage, 100% API documentation
- Ergonomic API: Builder patterns, sensible defaults, clear error messages
Performance
metal-candle demonstrates exceptional performance on Apple Silicon:
| Task | Batch Size | metal-candle | MLX | Speedup |
|---|---|---|---|---|
| Embeddings | 100 docs | 4.4ms | 113.5ms | 25.9x π |
| Embeddings | Single query | 3.9ms | 7.7ms | 2.0x |
| Throughput | - | 22,831 docs/sec | 881 docs/sec | 25.9x |
Near constant-time performance: Batch 1β100 only increases by 13% (3.9ms β 4.4ms)
See BENCHMARKS.md for detailed performance analysis and methodology.
Installation
Add to your Cargo.toml:
[]
= "1.2"
Requirements: Rust 1.75+, Apple Silicon Mac (M1/M2/M3/M4), macOS 12.0+
Quick Start
Text Generation
use ;
use Qwen;
// Load model
let model = new?;
// Configure generation
let gen_config = GeneratorConfig ;
// Generate tokens
let mut generator = new?;
let output_ids = generator.generate?;
// Or use streaming for real-time generation
generator.generate_stream?;
Semantic Embeddings (RAG & Search)
use ;
use Device;
// Load embedding model with Metal acceleration (25.9x faster than MLX!)
let device = new_metal?;
let model = from_pretrained?;
// Generate embeddings for semantic search
let texts = vec!;
let embeddings = model.encode?; // [batch, 384] in 3.9ms
// Batch processing: 100 docs in 4.4ms (22,831 docs/sec throughput)
let large_corpus = load_documents?;
let batch_embeddings = model.encode?;
LoRA Training
use ;
// Create LoRA adapter
let lora_config = LoRAAdapterConfig ;
let adapter = new?;
// Configure and train
let training_config = TrainingConfig ;
let trainer = new?;
let metrics = trainer.train?;
Features
Training: LoRA layers with dropout, AdamW optimizer, LR schedulers (Constant, Linear, Cosine, WarmupCosine), checkpoint management, gradient flow, cross-entropy loss with label smoothing
Inference: KV-cache (~173 MB for 2048 tokens), multiple sampling strategies (Greedy, Top-k, Top-p, Temperature), repetition penalty, streaming generation with callbacks, stop conditions (EOS tokens, custom tokens)
Models: Qwen2.5-Coder architecture, safetensors format, transformer components (RoPE, GQA, MLP), builder pattern with dtype conversion
Embeddings (feature: embeddings): Sentence transformers (E5-small-v2, MiniLM-L6-v2, MPNet-base-v2), HuggingFace Hub integration, mean pooling, L2 normalization, Metal acceleration
Quality: 254 tests (179 lib + 75 doc), β₯80% code coverage enforced, strict clippy pedantic linting, 100% API documentation, CI/CD on Apple Silicon
Architecture
Built on Candle with Metal backend:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β metal-candle (Public API) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Training β Inference β Models β
β β’ LoRAAdapter β β’ KVCache β β’ ModelLoader β
β β’ Trainer β β’ Sampling β β’ Qwen β
β β’ AdamW β β’ Generator β β’ Config β
β β’ Schedulers β β β
β β’ Checkpoint β Embeddings β β
β β β’ EmbeddingModel β β
β β β’ E5/MiniLM/MPNetβ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Candle Framework β
β β’ Tensor operations β’ Metal backend β’ Autograd β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Apple Metal API β
β (GPU acceleration on Apple Silicon) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
See ARCHITECTURE.md for detailed architecture documentation.
Documentation
- API Reference - Complete API documentation
- Architecture Guide - System design and implementation details
- Contributing Guide - Development standards and guidelines
- Benchmarks - Performance analysis and methodology
- Project Plan - Development roadmap and future plans
Examples
| Example | Description |
|---|---|
generate_text.rs |
Text generation with streaming and sampling |
train_lora.rs |
End-to-end LoRA training |
embeddings_demo.rs |
Semantic search with embeddings |
inference_demo.rs |
KV-cache and sampling demo |
load_model.rs |
Model loading and inspection |
Run examples:
Development
See CONTRIBUTING.md for detailed development setup, testing guidelines, and coding standards.
Quick start:
# Clone and build
# Run tests and checks
Quality standards enforced: Zero clippy warnings (pedantic), β₯80% code coverage, 100% API documentation, all tests passing.
Roadmap
v1.1 β Complete
- β Foundation & Metal Backend
- β Model Loading & Architecture (Qwen2.5-Coder)
- β LoRA Training Pipeline
- β Inference & Text Generation
- β High-level Generator API
- β Advanced sampling strategies with repetition penalty
- β Streaming generation with callbacks
- β Semantic embeddings (E5, MiniLM, MPNet)
- β Quality & Documentation
v1.2+ (Future)
- Generator KV-cache optimization (incremental token passing)
- Custom fused softmax kernel (Issue #27)
- GGUF format support
- Additional model architectures (LLaMA, Mistral)
- Quantization (4-bit, 8-bit)
- Flash Attention integration
- Multi-GPU support
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for code quality standards, testing requirements, and PR process.
Quick checklist:
-
cargo clippy -- -D warningspasses -
cargo testpasses -
cargo fmtapplied - New code has tests
- Public APIs documented
- No
unwrap()in library code
License
Licensed under the Apache License, Version 2.0 (LICENSE or http://www.apache.org/licenses/LICENSE-2.0).
The Apache License provides explicit patent protection, which is important for production machine learning libraries.
Acknowledgments
- Built on the excellent Candle framework by Hugging Face
- Inspired by MLX and llama.cpp
- LoRA implementation based on LoRA paper
Known Advisories
This project has two transitive dependencies flagged as unmaintained (not security issues):
number_prefix(via hf-hub β indicatif)paste(via candle-core β gemm/metal)
These are from major, trusted dependencies (Candle, HuggingFace) and pose no security risk. They will be resolved when upstream updates. See deny.toml for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: ARCHITECTURE.md | CONTRIBUTING.md
Status: β
v1.1.0 Released - Production Ready
Maintained by: @GarthDB
License: Apache-2.0