rlx-coreml 0.2.8

Apple CoreML / Neural Engine (ANE) backend for RLX — lowers the IR to an ML Program (MIL) and runs it through CoreML.framework
docs.rs failed to build rlx-coreml-0.2.8
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

rlx-coreml

Apple CoreML / Neural Engine (ANE) backend for RLX.

It lowers an RLX IR graph to a CoreML ML Program (the MIL dialect), serialises it into a .mlpackage, and runs it through CoreML.framework. CoreML's planner then schedules each op across the CPU, GPU, and Neural Engine.

Layout

path role
proto/coreml.proto focused subset of Apple's CoreML protobuf schema (exact field numbers)
src/mil.rs IR → MIL Program lowering (host-portable, no FFI)
src/mlpackage.rs .mlpackage bundle writer
src/chip.rs ANE / chip introspection (ane_available, chip_info)
csrc/coreml_shim.m Objective-C bridge over CoreML.framework (compiled by build.rs)
src/ffi.rs, src/backend.rs execution (Apple platforms only)

MIL emission and .mlpackage writing are pure Rust and build on every host. Only execution is gated behind target_os = "macos"/"ios".

Usage

use rlx_runtime::{Device, Session};

let mut compiled = Session::new(Device::Ane).compile(graph);
compiled.set_param("W", &weights);
let out = compiled.run(&[("x", &input)]);

or directly:

let mut exe = rlx_coreml::CoremlExecutable::compile(graph);
exe.set_param("W", &weights);
let out = exe.run(&[("x", &input)])?;

Status

Ops lowered to MIL today:

  • const / param (weights baked as inline immediate values)
  • matmul
  • element-wise binary: add / sub / mul / real_div / maximum / minimum / pow
  • activations: relu / sigmoid / tanh / gelu (exact + tanh-approx) / silu / exp / log / sqrt / rsqrt / abs / neg / sin / cos / tan / atan / round
  • softmax, layer_norm, rms_norm, group_norm, layer_norm_2d, batch_norm (inference) — composed where MIL lacks a native op
  • rope (NeoX split-halves) + axial_rope_2d (SAM2-style, baked tables)
  • attention (causal / bias / none masks, scale, logit softcap) composed from primitives — a full transformer block runs end-to-end
  • compare (all 6), where/select, expand, cumsum, scatter_add, top_k, lora_matmul, stop_gradient
  • MoE: grouped_matmul (gather-then-batched-matmul)
  • quantized weights: dequant_matmul, dequant_grouped_matmul (MoE), dequant_moe_weights — GGUF schemes Q8_0 / Q4_0 / Q2_K / Q3_K / Q4_K / Q5_K / Q6_K / Q8_K, host-dequantized to f32 at model-build time
  • quantize / dequantize (per-tensor & per-channel int8 fake-quant)
  • SSM (recurrent, unrolled over the sequence): selective_scan (Mamba), gated_delta_net (Qwen3.5) — verified against the CPU backend
  • vision: conv, conv_transpose_2d, pool (max/avg), resize_nearest_2x — NCHW
  • reduce (sum/mean/max/min/prod), concat, gather, narrow, cast
  • reshape, transpose

That's 42 of the IR's 106 op kinds — the complete inference compute surface (transformer + vision + MoE + quantized + SSM + elementwise).

GGUF-quantized weights arrive via set_param_typed (raw bytes) and are dequantized on the host when the .mlpackage is baked. This makes quantized models run; the proto then carries f32 weights (on-device constexpr dequant is a later size optimization).

Weights with ≥ 10 elements are written to weights/weight.bin in CoreML's MILBlob format and referenced from the proto by offset; smaller consts stay inline. This keeps the protobuf small enough for CoreML to parse even for LLM-scale models (an all-inline proto would blow past its limits).

If the serialized model.mlmodel still exceeds protobuf's ~2 GiB message cap (too many ops / large inline constants), packaging fails fast with an explicit CoremlError::TooLarge { bytes, limit } rather than a downstream CoreML parse error. mlpackage::check_model_size exposes the check.

Introspection: chip_info(), ane_available(), and per-op device routing via MLComputePlan (macOS 14.4+) on the loaded model.

Not yet implemented (extension surface)

The 64 remaining op kinds are not a tail of inference compute — every op a model executes during a forward pass is now covered. What's left is categorically different work:

  • Training / backward (~30) — every *Backward* op, SoftmaxCross Entropy*, the FakeQuantize (QAT) family. CoreML is an inference runtime; these cannot be lowered to a static prediction model.
  • Fusion-internal (~9)Fused*, ElementwiseRegion, TransformRegion. Created only by the fusion pass, which CoreML doesn't run, so the backend never sees them (it lowers the primitives instead).
  • Host / runtime-loopSample + Rng* (decode loop), control flow (If/While/Scan), Custom/CustomFn (need a CoreML op registry).
  • Int8-activation matmulQMatMul, QConv2d. Lowerable in principle (fp32-integer + host weight dequant), but their I/O is int8, which can't be a CoreML model input or output — only usable buried inside a fully int8 Quantize→…→Dequantize subgraph (a niche onnx-static-quant path; the common quant path is GGUF weight-quant, which is done).
  • No reference / no MIL opDotGeneral (the CPU backend itself leaves it unimplemented), Fft (no MIL FFT), DenseSolve (no MIL solver), Im2Col (a conv-internal detail), domain ops (GaussianSplat*, LogMel, WelchPeaks).

Other limits: GGUF dequant covers the 8 k-quant schemes wired in the dispatcher (IQ / ternary / microscaling fall through) and is host-dequantized to f32 (no on-device constexpr dequant yet); attention is rank-4 [B,H,S,D] only; RoPE requires last-dim == head_dim; SSM ops are unrolled (static seq); input shapes must be static.

Note on output layout

CoreML pads the inner dimension of rank-4 outputs for ANE alignment, so output MLMultiArrays can be non-contiguous; the shim copies them via logical (stride-aware) indexing rather than a flat memcpy.