candle-coreml 0.3.1

CoreML inference engine for Candle tensors - provides Apple CoreML/ANE integration with real tokenization, safety fixes, and model calibration awareness
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
# candle-coreml

CoreML inference engine for Candle tensors - providing Apple CoreML integration for Rust machine learning applications.

## Overview

`candle-coreml` is a standalone crate that bridges [Candle](https://github.com/huggingface/candle) tensors with Apple's CoreML framework, enabling efficient on-device inference on macOS and iOS. Unlike generic CoreML bindings, this crate provides:

- **Candle-specific integration** - Direct tensor conversion and device validation
- **Inference engine approach** - CoreML as an inference backend, not a device type
- **Apple Silicon optimization** - Leverages unified memory architecture
- **Production ready** - Comprehensive error handling and testing

## Key Features

- ✅ **Direct Candle tensor support** - CPU and Metal tensor inference
- ✅ **Device validation** - Automatic device compatibility checking  
- ✅ **Unified memory** - Efficient tensor conversion using M1/M2 architecture
- ✅ **Error handling** - Candle-compatible error types and messages
- ✅ **Comprehensive testing** - Unit tests, integration tests, and real model testing
- ✅ **Cross-platform builds** - Compiles on all platforms, runs on macOS

## Quick Start

Add to your `Cargo.toml`:

```toml
[dependencies]
candle-coreml = "0.3.1"
candle-core = "0.9.1"
```

Basic usage with UnifiedModelLoader (Recommended):

```rust
use candle_coreml::UnifiedModelLoader;

// Load model directly from HuggingFace with automatic setup
let loader = UnifiedModelLoader::new()?;
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;

// Generate text using the new API
let response = model.complete_text(
    "Hello, how are you?",
    50,   // max tokens  
    0.8,  // temperature
)?;

println!("Response: {}", response);
```

Manual CoreML model loading:

```rust
use candle_coreml::{CoreMLModel, ModelConfig};

// Load model config (typically auto-generated)
let config = ModelConfig::load_from_file("model_config.json")?;

// Load CoreML model components
let model = CoreMLModel::load_from_file("model.mlpackage", &config)?;

// Create input tensor
let input = candle_core::Tensor::zeros((1, 128), candle_core::DType::I64, &candle_core::Device::Cpu)?;

// Run inference
let output = model.forward(&[input])?;
```

## 🔥 ANEMLL Models: Multi-Component ANE Architecture

**[ANEMLL](https://github.com/Anemll/Anemll)** (pronounced "animal") provides state-of-the-art Apple Neural Engine optimizations for large language models. Our crate provides comprehensive support for ANEMLL's multi-component architecture.

### Why ANEMLL?

ANEMLL converts large models into **multiple specialized components** that maximize Apple Neural Engine utilization:

- **🚀 True ANE Acceleration**: Models specifically optimized for Apple's Neural Engine
- **💾 Memory Efficiency**: Component splitting reduces peak memory usage
- **⚡ Optimized Performance**: Custom quantization (LUT4/LUT6) for ANE constraints
- **🔧 Production Ready**: Used in real iOS/macOS apps via TestFlight

### Supported Models

| Model | Size | Context | Components | Status |
|-------|------|---------|------------|--------|
| Qwen 3 | 0.5B-7B | 512-32K | 3-part split | ✅ Fully Supported |
| Qwen 2.5 | 0.5B-7B | 512-32K | 3-part split | ✅ Fully Supported |

### Multi-Component Architecture

ANEMLL splits models into specialized components for optimal ANE performance:

```
Input Tokens → [Embeddings] → [FFN Transformer] → [LM Head] → Output Logits
               ↓              ↓                   ↓
               embeddings.    FFN_chunk_01.      lm_head.
               mlmodelc       mlmodelc           mlmodelc
```

#### Component Details:
1. **Embeddings Model** (`qwen_embeddings.mlmodelc`)
   - Converts token IDs to hidden representations
   - Output: `[batch, seq_len, hidden_dim]`

2. **FFN Model** (`qwen_FFN_PF_lut8_chunk_01of01.mlmodelc`) 
   - Transformer feed-forward network with attention
   - Includes causal masking for autoregressive generation
   - Output: `[batch, seq_len, hidden_dim]`

3. **LM Head Model** (`qwen_lm_head_lut8.mlmodelc`)
   - Final linear layer producing vocabulary logits
   - Input: Last position hidden state `[batch, 1, hidden_dim]`
   - Output: `[batch, 1, vocab_size]`

### Quick Start with ANEMLL Models

```rust
use candle_coreml::UnifiedModelLoader;

// Load complete multi-component model with automatic setup
let loader = UnifiedModelLoader::new()?;
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;

// Generate text using the new API methods
let response = model.complete_text(
    "Hello, how are you?",
    50,   // max tokens
    0.8,  // temperature  
)?;

// Or use the more advanced generation method
let tokens = model.generate_tokens_topk_temp(
    "Hello, how are you?",
    50,   // max tokens
    0.8,  // temperature
    Some(50), // top_k
)?;
```

### Manual Component Loading

For advanced use cases, load components individually:

```rust
use candle_coreml::{CoreMLModel, ModelConfig, QwenModel, QwenConfig};

// Option 1: Load from directory with auto-generated config  
let model_dir = "/path/to/downloaded/model";
let mut model = QwenModel::load_from_directory(&model_dir, None)?;

// Option 2: Manual component loading with ModelConfig
let config = ModelConfig::load_from_file("model_config.json")?;
let embeddings = CoreMLModel::load_from_file("embeddings.mlpackage", &config)?;
let ffn_prefill = CoreMLModel::load_from_file("ffn_prefill.mlpackage", &config)?;
let ffn_infer = CoreMLModel::load_from_file("ffn_infer.mlpackage", &config)?;
let lm_head = CoreMLModel::load_from_file("lm_head.mlpackage", &config)?;

// Use the high-level API for text generation
let response = model.complete_text("Hello!", 20, 0.7)?;
```

### Examples and Demos

```bash
# Recommended API demonstration
cargo run --example recommended_api_demo

# Multi-component chat with Qwen models (downloads ~2GB models)  
cargo run --example qwen_chat

# Test thinking behavior and quality
cargo run --example test_thinking_behavior
cargo run --example proper_quality_test

# Performance comparisons
cargo run --example compare_loading_approaches
```

### Model Download and Setup

ANEMLL models are hosted on HuggingFace and downloaded automatically:

```bash
# Models are cached in ~/.cache/candle-coreml/
# First run downloads all components (~2GB for Qwen 0.6B)

# Available models:
# - anemll/anemll-Qwen-Qwen3-0.6B-ctx512_0.3.4
# - anemll/anemll-Qwen-Qwen2.5-0.5B-ctx512_0.3.4
# - More models available at: https://huggingface.co/anemll
```

### Performance Characteristics

- **Context Length**: Optimized for 512-2048 tokens (up to 32K supported)
- **Quantization**: LUT4/LUT6 optimizations for ANE constraints  
- **Memory**: Component splitting reduces peak usage vs monolithic models
- **Speed**: True ANE acceleration vs GPU/CPU fallback

### Reference Implementation

ANEMLL provides reference apps showing production usage:
- **TestFlight App**: [Join Beta](https://testflight.apple.com/join/jrQq1D1C)
- **iOS/macOS Support**: Complete mobile deployment examples
- **GitHub**: [ANEMLL Repository](https://github.com/Anemll/Anemll)

### Integration with candle-coreml

Our crate provides the missing piece for Rust developers wanting to use ANEMLL's optimized models:

- ✅ **Explicit component selection via file paths (no globbing/discovery)**
- ✅ **Pipeline orchestration with proper data flow**  
- ✅ **Causal masking for transformer architectures**
- ✅ **HuggingFace integration for seamless model access**
- ✅ **Streaming generation with multi-component coordination**

This makes ANEMLL's advanced ANE optimizations accessible to the entire Candle ecosystem.

**📚 [Complete ANEMLL Integration Guide](ANEMLL_GUIDE.md)** - Comprehensive documentation covering architecture, usage patterns, and production deployment.

## Architecture

This crate follows the **inference engine** pattern rather than treating CoreML as a device backend:

- **Accepts**: CPU and Metal tensors via Candle's unified memory
- **Rejects**: CUDA tensors with clear error messages  
- **Output**: Tensors on the same device as input
- **Conversion**: Automatic F32/I64→I32 tensor conversion as needed

## Comparison with coreml-rs

| Feature | coreml-rs | candle-coreml |
|---------|-----------|---------------|
| Bindings | swift-bridge | objc2 direct |
| Purpose | Generic CoreML | Candle tensor integration |
| API | Raw CoreML interface | Candle patterns (T5-like) |
| Error Handling | Generic | Candle error types |
| Device Support | Generic | CPU/Metal validation |

## 🔥 Complete Worked Example

**👉 [BERT CoreML Inference - Step-by-Step Guide](examples/WORKED_EXAMPLE.md)**

A comprehensive tutorial covering:
- Model download and compilation
- End-to-end inference pipeline  
- Performance optimization tips
- ANE vs GPU vs CPU comparison
- Production deployment guidance

## ⚠️ When to Use CoreML (Important!)

### ✅ Use CoreML When:
- You have **CoreML-specific models** (`.mlpackage`/`.mlmodelc` files)
- You want **Apple Neural Engine (ANE)** acceleration for supported models
- You need Apple's automatic **hardware selection** (ANE → GPU → CPU)
- You're deploying specifically on **Apple platforms**

### ❌ Don't Use CoreML When:
- You can achieve the same performance with **Metal/CPU backends**
- Your model **isn't optimized** for Apple hardware  
- You need **cross-platform compatibility**
- You're **just starting** with Candle (try CPU/Metal first)

### 🧠 Apple Neural Engine (ANE) Reality Check

**Not all models run on the ANE!** Apple's Neural Engine has strict requirements:

- **Supported Operations**: Only a subset of ML operations are ANE-optimized
- **Model Architecture**: Models must be specifically designed/optimized for ANE  
- **Data Types**: Primarily supports certain quantized formats
- **Model Size**: Large models may fall back to GPU/CPU

**Recommendation**: Use Apple's pre-optimized models (like their optimized BERT) for guaranteed ANE acceleration, or stick with Metal/CPU backends for general use.

### 📊 Performance Hierarchy

```
ANE (fastest, most efficient) > GPU/Metal (fast) > CPU (most compatible)
```

Apple automatically chooses the best available backend, but your model must be ANE-compatible to benefit from the fastest option.

## 🚀 Modern API: UnifiedModelLoader

The recommended approach for loading and using models is through the `UnifiedModelLoader`, which handles:

- **Automatic HuggingFace Downloads**: Models are downloaded and cached automatically  
- **Config Generation**: Model configurations are generated from the downloaded files
- **Validation**: Comprehensive model validation and error checking
- **Caching**: Intelligent caching of both models and configurations

### UnifiedModelLoader Examples

```rust
use candle_coreml::UnifiedModelLoader;

// Create loader (initializes cache and config generation)
let loader = UnifiedModelLoader::new()?;

// Load any ANEMLL model from HuggingFace
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;

// Available generation methods:
// 1. High-level text completion (recommended)
let response = model.complete_text("Hello, world!", 50, 0.8)?;

// 2. Advanced token generation with top-k sampling  
let tokens = model.generate_tokens_topk_temp("Hello!", 20, 0.7, Some(40))?;

// 3. Single token prediction
let next_token = model.forward_text("Hello")?;

// 4. Text generation with parameters
let result = model.generate_text_with_params("Hello!", 30, 0.9)?;
```

### QwenModel API Reference

The `QwenModel` provides several methods for text generation:

| Method | Description | Use Case |
|--------|-------------|-----------|
| `complete_text(prompt, max_tokens, temperature)` | **Recommended** - High-level text completion | General text generation |
| `generate_tokens_topk_temp(prompt, max_tokens, temp, top_k)` | Advanced generation with top-k sampling | Fine-tuned control over generation |
| `forward_text(text)` | Single token prediction | Next token prediction, embeddings |  
| `generate_text_with_params(prompt, max_tokens, temperature)` | Text generation with custom parameters | Custom generation logic |
| ~~`generate_tokens()`~~ | **Deprecated** - Use `generate_tokens_topk_temp()` instead | Legacy compatibility only |

### Cache Management

Models and configs are cached automatically:

```rust
// Models cached in: ~/.cache/candle-coreml/models/  
// Configs cached in: ~/.cache/candle-coreml/configs/

// Clear caches if needed
use candle_coreml::CacheManager;
let cache = CacheManager::new()?;
// cache.clear_model_cache()?; // if needed
```

## Model Configuration System (Advanced Usage)

Complex multi-component language models (e.g. ANEMLL Qwen variants, custom fine-tunes) are described declaratively using a `ModelConfig` JSON file. This removes hardcoded shapes and enables:

- Explicit component file paths (no globbing)
- Per-component input/output tensor shapes & dtypes
- Multipart logits combination (auto-detected part count)
- Split vs unified FFN execution (`ffn_execution` = `split` | `unified`)
- Automatic detection of prefill mode (batched vs sequential single-token)

### Minimal Example

```jsonc
{
    "model_info": { "model_type": "qwen", "path": "/path/to/model" },
    "shapes": { "batch_size": 64, "context_length": 256, "hidden_size": 1024, "vocab_size": 151669 },
    "components": {
        "embeddings": { "file_path": "embeddings.mlpackage", "inputs": { "input_ids": {"shape": [1,64], "data_type": "INT32", "name": "input_ids" } }, "outputs": { "hidden_states": {"shape": [1,64,1024], "data_type": "FLOAT16", "name": "hidden_states" } }, "functions": [] },
        "ffn_prefill": { "file_path": "ffn_prefill.mlpackage", "inputs": { "hidden_states": {"shape": [1,64,1024], "data_type": "FLOAT16","name":"hidden_states"}, "position_ids": {"shape":[64],"data_type":"INT32","name":"position_ids"}, "causal_mask": {"shape":[1,1,64,256],"data_type":"FLOAT16","name":"causal_mask"}, "current_pos": {"shape":[1],"data_type":"INT32","name":"current_pos"} }, "outputs": { "output_hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"output_hidden_states"} }, "functions":["prefill"] },
        "ffn_infer": { "file_path": "ffn_infer.mlpackage", "inputs": { "hidden_states": {"shape": [1,1,1024], "data_type": "FLOAT16","name":"hidden_states"}, "position_ids": {"shape":[1],"data_type":"INT32","name":"position_ids"}, "causal_mask": {"shape":[1,1,1,256],"data_type":"FLOAT16","name":"causal_mask"}, "current_pos": {"shape":[1],"data_type":"INT32","name":"current_pos"} }, "outputs": { "output_hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"output_hidden_states"} }, "functions":["infer"] },
        "lm_head": { "file_path": "lm_head.mlpackage", "inputs": { "hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"hidden_states" } }, "outputs": { "logits1": {"shape":[1,1,9480],"data_type":"FLOAT16","name":"logits1"}, "logits2": {"shape":[1,1,9479],"data_type":"FLOAT16","name":"logits2"} }, "functions": [] }
    },
    "ffn_execution": "split"
}
```

### Execution Modes

| Mode | When | Behavior |
|------|------|----------|
| `unified` | Single CoreML package exposes `prefill` & `infer` functions | Shared file, one state, batched prefill then token-by-token infer |
| `split` | Separate `ffn_prefill` & `ffn_infer` model files | Distinct model files; state created from prefill model and reused for infer |

If `ffn_execution` is omitted, the system infers `split` when `ffn_prefill.file_path != ffn_infer.file_path`.

### Prefill Modes

Prefill can be either batch (process full sequence in one call) or sequential (one token at a time). Sequential mode is auto-enabled when `ffn_prefill.hidden_states` shape has `seq_len == 1` (e.g. `[1,1,H]`) indicating a single-token CoreML prefill variant. This matches certain fine-tuned or distilled models exported with single-token kernels.

### Multipart Logits

The LM head may output `logits1..logitsN`. The library detects count dynamically and stitches them into a contiguous logits tensor. No manual configuration needed beyond listing outputs.

### Validation

`ModelConfig::validate()` checks basic consistency; `validate_internal_wiring()` ensures adjacent component tensor shapes align (e.g. embeddings → ffn_prefill). Warnings are logged but loading proceeds to aid iterative development.

### Custom Model Guide

See `CUSTOM_MODEL_GUIDE.md` for deep-dive shape discovery tooling and advanced customization.

### Migrating From Globs

Legacy filename pattern discovery has been removed. Always set `file_path` for each component—this avoids ambiguity and improves reproducibility.

### Troubleshooting

| Symptom | Likely Cause | Fix |
|---------|--------------|-----|
| `MultiArray shape (64) does not match shape (1)` | Prefill or infer mismatch between batch vs single-token tensors | Ensure correct `ffn_prefill` / `ffn_infer` shapes or adjust to sequential mode by setting prefill hidden_states to `[1,1,H]` |
| Missing logits concatenation | Outputs not named `logits*` | Rename outputs or manually post-process |
| Incorrect token length padding | Embeddings `input_ids` shape mismatch | Align `embeddings.inputs.input_ids.shape` with expected max prefill length |
| LM head shape mismatch | `output_hidden_states` vs `lm_head.hidden_states` differ | Regenerate config with discovery tool; fix shapes |

For detailed examples see `configs/` directory (e.g. `anemll-qwen3-0.6b.json`).


## Examples

The `examples/` directory demonstrates various usage patterns:

### 🌟 Recommended Starting Points

- **[recommended_api_demo.rs](examples/recommended_api_demo.rs)** - **START HERE** - Shows the modern UnifiedModelLoader API
- **[qwen_chat.rs](examples/qwen_chat.rs)** - Interactive chat using ANEMLL Qwen models  
- **[proper_quality_test.rs](examples/proper_quality_test.rs)** - Model quality assessment

### 🔧 Advanced Examples

- **[compare_loading_approaches.rs](examples/compare_loading_approaches.rs)** - Compare old vs new loading methods
- **[test_thinking_behavior.rs](examples/test_thinking_behavior.rs)** - Test model reasoning capabilities  
- **[debug_token_mismatch.rs](examples/debug_token_mismatch.rs)** - Debugging token generation issues

### 📚 Documentation Examples

- **[WORKED_EXAMPLE.md](examples/WORKED_EXAMPLE.md)** - Complete BERT inference tutorial (legacy)
- **[qwen/README.md](examples/qwen/README.md)** - Qwen model documentation

### Running Examples

```bash
# Start with the recommended API
cargo run --example recommended_api_demo

# Interactive Qwen chat (downloads ~2GB on first run)
cargo run --example qwen_chat

# Test model quality  
cargo run --example proper_quality_test

# Compare loading approaches
cargo run --example compare_loading_approaches
```

### ✨ Key Features Demonstrated

- **🚀 UnifiedModelLoader**: Automatic downloading, config generation, and caching
- **🧠 Multi-Component Architecture**: ANEMLL's specialized model components  
- **⚡ ANE Acceleration**: True Apple Neural Engine optimization
- **🔧 Advanced Generation**: Top-k sampling, temperature control, quality assessment
- **📦 HuggingFace Integration**: Seamless model access from HuggingFace Hub

## Platform Support

- **macOS**: Full CoreML runtime support
- **iOS**: Full CoreML runtime support (when targeting iOS)
- **Other platforms**: Builds successfully, runtime features disabled

## Contributing

This is an independent project providing CoreML integration for the Candle ecosystem. Contributions welcome!

## License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.