candle-coreml

CoreML inference engine for Candle tensors - providing Apple CoreML integration for Rust machine learning applications.

Overview

candle-coreml is a standalone crate that bridges Candle tensors with Apple's CoreML framework, enabling efficient on-device inference on macOS and iOS. Unlike generic CoreML bindings, this crate provides:

Candle-specific integration - Direct tensor conversion and device validation
Inference engine approach - CoreML as an inference backend, not a device type
Apple Silicon optimization - Leverages unified memory architecture
Production ready - Comprehensive error handling and testing

Key Features

✅ Direct Candle tensor support - CPU and Metal tensor inference
✅ Device validation - Automatic device compatibility checking
✅ Unified memory - Efficient tensor conversion using M1/M2 architecture
✅ Error handling - Candle-compatible error types and messages
✅ Comprehensive testing - Unit tests, integration tests, and real model testing
✅ Cross-platform builds - Compiles on all platforms, runs on macOS

Quick Start

Add to your Cargo.toml:

[dependencies]
candle-coreml = "0.3.1"
candle-core = "0.9.1"

Basic usage with UnifiedModelLoader (Recommended):

use candle_coreml::UnifiedModelLoader;

// Load model directly from HuggingFace with automatic setup
let loader = UnifiedModelLoader::new()?;
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;

// Generate text using the new API
let response = model.complete_text(
    "Hello, how are you?",
    50,   // max tokens  
    0.8,  // temperature
)?;

println!("Response: {}", response);

Manual CoreML model loading:

use candle_coreml::{CoreMLModel, ModelConfig};

// Load model config (typically auto-generated)
let config = ModelConfig::load_from_file("model_config.json")?;

// Load CoreML model components
let model = CoreMLModel::load_from_file("model.mlpackage", &config)?;

// Create input tensor
let input = candle_core::Tensor::zeros((1, 128), candle_core::DType::I64, &candle_core::Device::Cpu)?;

// Run inference
let output = model.forward(&[input])?;

🔥 ANEMLL Models: Multi-Component ANE Architecture

ANEMLL (pronounced "animal") provides state-of-the-art Apple Neural Engine optimizations for large language models. Our crate provides comprehensive support for ANEMLL's multi-component architecture.

Why ANEMLL?

ANEMLL converts large models into multiple specialized components that maximize Apple Neural Engine utilization:

🚀 True ANE Acceleration: Models specifically optimized for Apple's Neural Engine
💾 Memory Efficiency: Component splitting reduces peak memory usage
⚡ Optimized Performance: Custom quantization (LUT4/LUT6) for ANE constraints
🔧 Production Ready: Used in real iOS/macOS apps via TestFlight

Supported Models

Model	Size	Context	Components	Status
Qwen 3	0.5B-7B	512-32K	3-part split	✅ Fully Supported
Qwen 2.5	0.5B-7B	512-32K	3-part split	✅ Fully Supported

Multi-Component Architecture

ANEMLL splits models into specialized components for optimal ANE performance:

Input Tokens → [Embeddings] → [FFN Transformer] → [LM Head] → Output Logits
               ↓              ↓                   ↓
               embeddings.    FFN_chunk_01.      lm_head.
               mlmodelc       mlmodelc           mlmodelc

Component Details:

Embeddings Model (qwen_embeddings.mlmodelc)
- Converts token IDs to hidden representations
- Output: [batch, seq_len, hidden_dim]
FFN Model (qwen_FFN_PF_lut8_chunk_01of01.mlmodelc)
- Transformer feed-forward network with attention
- Includes causal masking for autoregressive generation
- Output: [batch, seq_len, hidden_dim]
LM Head Model (qwen_lm_head_lut8.mlmodelc)
- Final linear layer producing vocabulary logits
- Input: Last position hidden state [batch, 1, hidden_dim]
- Output: [batch, 1, vocab_size]

Quick Start with ANEMLL Models

use candle_coreml::UnifiedModelLoader;

// Load complete multi-component model with automatic setup
let loader = UnifiedModelLoader::new()?;
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;

// Generate text using the new API methods
let response = model.complete_text(
    "Hello, how are you?",
    50,   // max tokens
    0.8,  // temperature  
)?;

// Or use the more advanced generation method
let tokens = model.generate_tokens_topk_temp(
    "Hello, how are you?",
    50,   // max tokens
    0.8,  // temperature
    Some(50), // top_k
)?;

Manual Component Loading

For advanced use cases, load components individually:

use candle_coreml::{CoreMLModel, ModelConfig, QwenModel, QwenConfig};

// Option 1: Load from directory with auto-generated config  
let model_dir = "/path/to/downloaded/model";
let mut model = QwenModel::load_from_directory(&model_dir, None)?;

// Option 2: Manual component loading with ModelConfig
let config = ModelConfig::load_from_file("model_config.json")?;
let embeddings = CoreMLModel::load_from_file("embeddings.mlpackage", &config)?;
let ffn_prefill = CoreMLModel::load_from_file("ffn_prefill.mlpackage", &config)?;
let ffn_infer = CoreMLModel::load_from_file("ffn_infer.mlpackage", &config)?;
let lm_head = CoreMLModel::load_from_file("lm_head.mlpackage", &config)?;

// Use the high-level API for text generation
let response = model.complete_text("Hello!", 20, 0.7)?;

Examples and Demos

# Recommended API demonstration
cargo run --example recommended_api_demo

# Multi-component chat with Qwen models (downloads ~2GB models)  
cargo run --example qwen_chat

# Test thinking behavior and quality
cargo run --example test_thinking_behavior
cargo run --example proper_quality_test

# Performance comparisons
cargo run --example compare_loading_approaches

Model Download and Setup

ANEMLL models are hosted on HuggingFace and downloaded automatically:

# Models are cached in ~/.cache/candle-coreml/
# First run downloads all components (~2GB for Qwen 0.6B)

# Available models:
# - anemll/anemll-Qwen-Qwen3-0.6B-ctx512_0.3.4
# - anemll/anemll-Qwen-Qwen2.5-0.5B-ctx512_0.3.4
# - More models available at: https://huggingface.co/anemll

Performance Characteristics

Context Length: Optimized for 512-2048 tokens (up to 32K supported)
Quantization: LUT4/LUT6 optimizations for ANE constraints
Memory: Component splitting reduces peak usage vs monolithic models
Speed: True ANE acceleration vs GPU/CPU fallback

Reference Implementation

ANEMLL provides reference apps showing production usage:

TestFlight App: Join Beta
iOS/macOS Support: Complete mobile deployment examples
GitHub: ANEMLL Repository

Integration with candle-coreml

Our crate provides the missing piece for Rust developers wanting to use ANEMLL's optimized models:

✅ Explicit component selection via file paths (no globbing/discovery)
✅ Pipeline orchestration with proper data flow
✅ Causal masking for transformer architectures
✅ HuggingFace integration for seamless model access
✅ Streaming generation with multi-component coordination

This makes ANEMLL's advanced ANE optimizations accessible to the entire Candle ecosystem.

📚 Complete ANEMLL Integration Guide - Comprehensive documentation covering architecture, usage patterns, and production deployment.

Architecture

This crate follows the inference engine pattern rather than treating CoreML as a device backend:

Accepts: CPU and Metal tensors via Candle's unified memory
Rejects: CUDA tensors with clear error messages
Output: Tensors on the same device as input
Conversion: Automatic F32/I64→I32 tensor conversion as needed

Comparison with coreml-rs

Feature	coreml-rs	candle-coreml
Bindings	swift-bridge	objc2 direct
Purpose	Generic CoreML	Candle tensor integration
API	Raw CoreML interface	Candle patterns (T5-like)
Error Handling	Generic	Candle error types
Device Support	Generic	CPU/Metal validation

🔥 Complete Worked Example

👉 BERT CoreML Inference - Step-by-Step Guide

A comprehensive tutorial covering:

Model download and compilation
End-to-end inference pipeline
Performance optimization tips
ANE vs GPU vs CPU comparison
Production deployment guidance

⚠️ When to Use CoreML (Important!)

✅ Use CoreML When:

You have CoreML-specific models (.mlpackage/.mlmodelc files)
You want Apple Neural Engine (ANE) acceleration for supported models
You need Apple's automatic hardware selection (ANE → GPU → CPU)
You're deploying specifically on Apple platforms

❌ Don't Use CoreML When:

You can achieve the same performance with Metal/CPU backends
Your model isn't optimized for Apple hardware
You need cross-platform compatibility
You're just starting with Candle (try CPU/Metal first)

🧠 Apple Neural Engine (ANE) Reality Check

Not all models run on the ANE! Apple's Neural Engine has strict requirements:

Supported Operations: Only a subset of ML operations are ANE-optimized
Model Architecture: Models must be specifically designed/optimized for ANE
Data Types: Primarily supports certain quantized formats
Model Size: Large models may fall back to GPU/CPU

Recommendation: Use Apple's pre-optimized models (like their optimized BERT) for guaranteed ANE acceleration, or stick with Metal/CPU backends for general use.

📊 Performance Hierarchy

ANE (fastest, most efficient) > GPU/Metal (fast) > CPU (most compatible)

Apple automatically chooses the best available backend, but your model must be ANE-compatible to benefit from the fastest option.

🚀 Modern API: UnifiedModelLoader

The recommended approach for loading and using models is through the UnifiedModelLoader, which handles:

Automatic HuggingFace Downloads: Models are downloaded and cached automatically
Config Generation: Model configurations are generated from the downloaded files
Validation: Comprehensive model validation and error checking
Caching: Intelligent caching of both models and configurations

UnifiedModelLoader Examples

use candle_coreml::UnifiedModelLoader;

// Create loader (initializes cache and config generation)
let loader = UnifiedModelLoader::new()?;

// Load any ANEMLL model from HuggingFace
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;

// Available generation methods:
// 1. High-level text completion (recommended)
let response = model.complete_text("Hello, world!", 50, 0.8)?;

// 2. Advanced token generation with top-k sampling  
let tokens = model.generate_tokens_topk_temp("Hello!", 20, 0.7, Some(40))?;

// 3. Single token prediction
let next_token = model.forward_text("Hello")?;

// 4. Text generation with parameters
let result = model.generate_text_with_params("Hello!", 30, 0.9)?;

QwenModel API Reference

The QwenModel provides several methods for text generation:

Method	Description	Use Case
`complete_text(prompt, max_tokens, temperature)`	Recommended - High-level text completion	General text generation
`generate_tokens_topk_temp(prompt, max_tokens, temp, top_k)`	Advanced generation with top-k sampling	Fine-tuned control over generation
`forward_text(text)`	Single token prediction	Next token prediction, embeddings
`generate_text_with_params(prompt, max_tokens, temperature)`	Text generation with custom parameters	Custom generation logic
~~`generate_tokens()`~~	Deprecated - Use `generate_tokens_topk_temp()` instead	Legacy compatibility only

Cache Management

Models and configs are cached automatically:

// Models cached in: ~/.cache/candle-coreml/models/  
// Configs cached in: ~/.cache/candle-coreml/configs/

// Clear caches if needed
use candle_coreml::CacheManager;
let cache = CacheManager::new()?;
// cache.clear_model_cache()?; // if needed

Model Configuration System (Advanced Usage)

Complex multi-component language models (e.g. ANEMLL Qwen variants, custom fine-tunes) are described declaratively using a ModelConfig JSON file. This removes hardcoded shapes and enables:

Explicit component file paths (no globbing)
Per-component input/output tensor shapes & dtypes
Multipart logits combination (auto-detected part count)
Split vs unified FFN execution (ffn_execution = split | unified)
Automatic detection of prefill mode (batched vs sequential single-token)

Minimal Example

{
    "model_info": { "model_type": "qwen", "path": "/path/to/model" },
    "shapes": { "batch_size": 64, "context_length": 256, "hidden_size": 1024, "vocab_size": 151669 },
    "components": {
        "embeddings": { "file_path": "embeddings.mlpackage", "inputs": { "input_ids": {"shape": [1,64], "data_type": "INT32", "name": "input_ids" } }, "outputs": { "hidden_states": {"shape": [1,64,1024], "data_type": "FLOAT16", "name": "hidden_states" } }, "functions": [] },
        "ffn_prefill": { "file_path": "ffn_prefill.mlpackage", "inputs": { "hidden_states": {"shape": [1,64,1024], "data_type": "FLOAT16","name":"hidden_states"}, "position_ids": {"shape":[64],"data_type":"INT32","name":"position_ids"}, "causal_mask": {"shape":[1,1,64,256],"data_type":"FLOAT16","name":"causal_mask"}, "current_pos": {"shape":[1],"data_type":"INT32","name":"current_pos"} }, "outputs": { "output_hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"output_hidden_states"} }, "functions":["prefill"] },
        "ffn_infer": { "file_path": "ffn_infer.mlpackage", "inputs": { "hidden_states": {"shape": [1,1,1024], "data_type": "FLOAT16","name":"hidden_states"}, "position_ids": {"shape":[1],"data_type":"INT32","name":"position_ids"}, "causal_mask": {"shape":[1,1,1,256],"data_type":"FLOAT16","name":"causal_mask"}, "current_pos": {"shape":[1],"data_type":"INT32","name":"current_pos"} }, "outputs": { "output_hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"output_hidden_states"} }, "functions":["infer"] },
        "lm_head": { "file_path": "lm_head.mlpackage", "inputs": { "hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"hidden_states" } }, "outputs": { "logits1": {"shape":[1,1,9480],"data_type":"FLOAT16","name":"logits1"}, "logits2": {"shape":[1,1,9479],"data_type":"FLOAT16","name":"logits2"} }, "functions": [] }
    },
    "ffn_execution": "split"
}

Execution Modes

Mode	When	Behavior
`unified`	Single CoreML package exposes `prefill` & `infer` functions	Shared file, one state, batched prefill then token-by-token infer
`split`	Separate `ffn_prefill` & `ffn_infer` model files	Distinct model files; state created from prefill model and reused for infer

If ffn_execution is omitted, the system infers split when ffn_prefill.file_path != ffn_infer.file_path.

Prefill Modes

Prefill can be either batch (process full sequence in one call) or sequential (one token at a time). Sequential mode is auto-enabled when ffn_prefill.hidden_states shape has seq_len == 1 (e.g. [1,1,H]) indicating a single-token CoreML prefill variant. This matches certain fine-tuned or distilled models exported with single-token kernels.

Multipart Logits

The LM head may output logits1..logitsN. The library detects count dynamically and stitches them into a contiguous logits tensor. No manual configuration needed beyond listing outputs.

Validation

ModelConfig::validate() checks basic consistency; validate_internal_wiring() ensures adjacent component tensor shapes align (e.g. embeddings → ffn_prefill). Warnings are logged but loading proceeds to aid iterative development.

Custom Model Guide

See CUSTOM_MODEL_GUIDE.md for deep-dive shape discovery tooling and advanced customization.

Migrating From Globs

Legacy filename pattern discovery has been removed. Always set file_path for each component—this avoids ambiguity and improves reproducibility.

Troubleshooting

Symptom	Likely Cause	Fix
`MultiArray shape (64) does not match shape (1)`	Prefill or infer mismatch between batch vs single-token tensors	Ensure correct `ffn_prefill` / `ffn_infer` shapes or adjust to sequential mode by setting prefill hidden_states to `[1,1,H]`
Missing logits concatenation	Outputs not named `logits*`	Rename outputs or manually post-process
Incorrect token length padding	Embeddings `input_ids` shape mismatch	Align `embeddings.inputs.input_ids.shape` with expected max prefill length
LM head shape mismatch	`output_hidden_states` vs `lm_head.hidden_states` differ	Regenerate config with discovery tool; fix shapes

For detailed examples see configs/ directory (e.g. anemll-qwen3-0.6b.json).

Examples

The examples/ directory demonstrates various usage patterns:

🌟 Recommended Starting Points

recommended_api_demo.rs - START HERE - Shows the modern UnifiedModelLoader API
qwen_chat.rs - Interactive chat using ANEMLL Qwen models
proper_quality_test.rs - Model quality assessment

🔧 Advanced Examples

compare_loading_approaches.rs - Compare old vs new loading methods
test_thinking_behavior.rs - Test model reasoning capabilities
debug_token_mismatch.rs - Debugging token generation issues

📚 Documentation Examples

WORKED_EXAMPLE.md - Complete BERT inference tutorial (legacy)
qwen/README.md - Qwen model documentation

Running Examples

# Start with the recommended API
cargo run --example recommended_api_demo

# Interactive Qwen chat (downloads ~2GB on first run)
cargo run --example qwen_chat

# Test model quality  
cargo run --example proper_quality_test

# Compare loading approaches
cargo run --example compare_loading_approaches

✨ Key Features Demonstrated

🚀 UnifiedModelLoader: Automatic downloading, config generation, and caching
🧠 Multi-Component Architecture: ANEMLL's specialized model components
⚡ ANE Acceleration: True Apple Neural Engine optimization
🔧 Advanced Generation: Top-k sampling, temperature control, quality assessment
📦 HuggingFace Integration: Seamless model access from HuggingFace Hub

Platform Support

macOS: Full CoreML runtime support
iOS: Full CoreML runtime support (when targeting iOS)
Other platforms: Builds successfully, runtime features disabled

Contributing

This is an independent project providing CoreML integration for the Candle ecosystem. Contributions welcome!

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

candle-coreml 0.3.1