# rlx-coreml
Apple **CoreML / Neural Engine (ANE)** backend for RLX.
It lowers an RLX IR graph to a CoreML **ML Program** (the MIL dialect),
serialises it into a `.mlpackage`, and runs it through `CoreML.framework`.
CoreML's planner then schedules each op across the CPU, GPU, and Neural
Engine.
## Layout
| `proto/coreml.proto` | focused subset of Apple's CoreML protobuf schema (exact field numbers) |
| `src/mil.rs` | IR → MIL `Program` lowering (host-portable, no FFI) |
| `src/mlpackage.rs` | `.mlpackage` bundle writer |
| `src/chip.rs` | ANE / chip introspection (`ane_available`, `chip_info`) |
| `csrc/coreml_shim.m` | Objective-C bridge over `CoreML.framework` (compiled by `build.rs`) |
| `src/ffi.rs`, `src/backend.rs` | execution (Apple platforms only) |
MIL emission and `.mlpackage` writing are pure Rust and build on every
host. Only execution is gated behind `target_os = "macos"`/`"ios"`.
## Usage
```rust
use rlx_runtime::{Device, Session};
let mut compiled = Session::new(Device::Ane).compile(graph);
compiled.set_param("W", &weights);
let out = compiled.run(&[("x", &input)]);
```
or directly:
```rust
let mut exe = rlx_coreml::CoremlExecutable::compile(graph);
exe.set_param("W", &weights);
let out = exe.run(&[("x", &input)])?;
```
## Status
Ops lowered to MIL today:
- `const` / `param` (weights baked as inline immediate values)
- `matmul`
- element-wise binary: `add` / `sub` / `mul` / `real_div` / `maximum` /
`minimum` / `pow`
- activations: `relu` / `sigmoid` / `tanh` / `gelu` (exact + tanh-approx) /
`silu` / `exp` / `log` / `sqrt` / `rsqrt` / `abs` / `neg` / `sin` /
`cos` / `tan` / `atan` / `round`
- `softmax`, `layer_norm`, `rms_norm`, `group_norm`, `layer_norm_2d`,
`batch_norm` (inference) — composed where MIL lacks a native op
- `rope` (NeoX split-halves) + `axial_rope_2d` (SAM2-style, baked tables)
- `attention` (causal / bias / none masks, scale, logit softcap) composed
from primitives — a full transformer block runs end-to-end
- `compare` (all 6), `where`/select, `expand`, `cumsum`, `scatter_add`,
`top_k`, `lora_matmul`, `stop_gradient`
- MoE: `grouped_matmul` (gather-then-batched-matmul)
- quantized weights: `dequant_matmul`, `dequant_grouped_matmul` (MoE),
`dequant_moe_weights` — GGUF schemes Q8_0 / Q4_0 / Q2_K / Q3_K / Q4_K /
Q5_K / Q6_K / Q8_K, **host-dequantized to f32 at model-build time**
- `quantize` / `dequantize` (per-tensor & per-channel int8 fake-quant)
- SSM (recurrent, unrolled over the sequence): `selective_scan` (Mamba),
`gated_delta_net` (Qwen3.5) — verified against the CPU backend
- vision: `conv`, `conv_transpose_2d`, `pool` (max/avg), `resize_nearest_2x`
— NCHW
- `reduce` (sum/mean/max/min/prod), `concat`, `gather`, `narrow`, `cast`
- `reshape`, `transpose`
That's **42 of the IR's 106 op kinds** — the complete inference compute
surface (transformer + vision + MoE + quantized + SSM + elementwise).
GGUF-quantized weights arrive via `set_param_typed` (raw bytes) and are
dequantized on the host when the `.mlpackage` is baked. This makes
quantized models *run*; the proto then carries f32 weights (on-device
`constexpr` dequant is a later size optimization).
Weights with ≥ 10 elements are written to `weights/weight.bin` in CoreML's
**MILBlob format** and referenced from the proto by offset; smaller consts
stay inline. This keeps the protobuf small enough for CoreML to parse even
for LLM-scale models (an all-inline proto would blow past its limits).
If the serialized `model.mlmodel` still exceeds protobuf's ~2 GiB message
cap (too many ops / large inline constants), packaging fails fast with an
explicit `CoremlError::TooLarge { bytes, limit }` rather than a downstream
CoreML parse error. `mlpackage::check_model_size` exposes the check.
Introspection: `chip_info()`, `ane_available()`, and per-op device
routing via `MLComputePlan` (macOS 14.4+) on the loaded model.
### Not yet implemented (extension surface)
The 64 remaining op kinds are **not** a tail of inference compute — every
op a model executes during a forward pass is now covered. What's left is
categorically different work:
- **Training / backward (~30)** — every `*Backward*` op, `SoftmaxCross
Entropy*`, the `FakeQuantize` (QAT) family. CoreML is an inference
runtime; these cannot be lowered to a static prediction model.
- **Fusion-internal (~9)** — `Fused*`, `ElementwiseRegion`,
`TransformRegion`. Created only by the fusion pass, which CoreML doesn't
run, so the backend never sees them (it lowers the primitives instead).
- **Host / runtime-loop** — `Sample` + `Rng*` (decode loop), control flow
(`If`/`While`/`Scan`), `Custom`/`CustomFn` (need a CoreML op registry).
- **Int8-activation matmul** — `QMatMul`, `QConv2d`. Lowerable in principle
(fp32-integer + host weight dequant), but their I/O is int8, which can't
be a CoreML model input *or* output — only usable buried inside a fully
int8 `Quantize→…→Dequantize` subgraph (a niche onnx-static-quant path;
the common quant path is GGUF weight-quant, which is done).
- **No reference / no MIL op** — `DotGeneral` (the CPU backend itself
leaves it unimplemented), `Fft` (no MIL FFT), `DenseSolve` (no MIL
solver), `Im2Col` (a conv-internal detail), domain ops (`GaussianSplat*`,
`LogMel`, `WelchPeaks`).
Other limits: GGUF dequant covers the 8 k-quant schemes wired in the
dispatcher (IQ / ternary / microscaling fall through) and is
host-dequantized to f32 (no on-device `constexpr` dequant yet); attention
is rank-4 `[B,H,S,D]` only; RoPE requires last-dim == `head_dim`; SSM ops
are unrolled (static seq); input shapes must be static.
### Note on output layout
CoreML pads the inner dimension of rank-4 outputs for ANE alignment, so
output `MLMultiArray`s can be non-contiguous; the shim copies them via
logical (stride-aware) indexing rather than a flat `memcpy`.