rlx-coreml 0.2.10

Apple CoreML / Neural Engine (ANE) backend for RLX — lowers the IR to an ML Program (MIL) and runs it through CoreML.framework
docs.rs failed to build rlx-coreml-0.2.10
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

rlx-coreml

Apple CoreML / Neural Engine (ANE) backend for RLX.

GGUF on-device dequant, hybrid host segments, and env toggles: docs/gguf-backend-paths.md (ANE section).

It lowers an RLX IR graph to a CoreML ML Program (the MIL dialect), serialises it into a .mlpackage, and runs it through CoreML.framework. CoreML's planner then schedules each op across the CPU, GPU, and Neural Engine.

Layout

path role
proto/coreml.proto focused subset of Apple's CoreML protobuf schema (exact field numbers)
src/mil/ IR → MIL Program lowering (host-portable, no FFI)
src/mlpackage.rs .mlpackage bundle writer
src/hybrid.rs segment-based host + CoreML execution planner
src/host_exec.rs host ops (FFT, RNN, sampling, custom kernels, …)
src/op_registry.rs Op::Custom kernel registry
src/chip.rs ANE / chip introspection (ane_available, chip_info)
csrc/coreml_shim.m Objective-C bridge over CoreML.framework (compiled by build.rs)
src/ffi.rs, src/backend.rs execution (Apple platforms only)

MIL emission and .mlpackage writing are pure Rust and build on every host. Only execution is gated behind target_os = "macos"/"ios".

Usage

use rlx_runtime::{Device, Precision, Session};

// F16 via Session precision
let session = Session::new_with_precision(Device::Ane, Precision::F16);

let mut compiled = session.compile(graph);
compiled.set_param("W", &weights);
let out = compiled.run(&[("x", &input)]);

Custom op on ANE (host segment):

rlx_coreml::register_coreml_kernel(Arc::new(MyKernel));

Environment knobs:

Variable Effect
RLX_COREML_UNITS cpu / gpu / all / default CPU+ANE
RLX_COREML_HOST_DEQUANT=1 bake full f32 weights at compile (legacy path)
RLX_COREML_FLEXIBLE_INPUTS=1 emit CoreML ShapeRange on dynamic inputs
RLX_COREML_NATIVE_FLEX=1 ANE: one model + runtime shapes (skip DeferredExecutable)

Status

58 op kinds declared in COREML_SUPPORTED_OPS — the complete forward inference surface for transformer + vision + MoE + quantized + SSM graphs.

MIL-lowered ops

Element-wise, matmul, norms, attention (causal / bias / sliding window), vision (conv / pool / resize), MoE grouped_matmul, SSM (selective_scan, gated_delta_net), quantized dequant_* (with on-device block dequant for Q8_0 / Q4_0 / IQ4NL), quantize / dequantize, reductions, gather, rope, and the rest of the primitive set listed in rlx-runtime COREML_SUPPORTED_OPS.

On-device constexpr dequant also covers K-quants Q4_K / Q5_K / Q8_K and Q2_K / Q3_K / Q6_K (mul + optional sub; Q2/Q3/Q6 use per-element [nb,32] scale/offset tensors when sub-block scales vary within a 32-chunk). Legacy Q4_1 (scale + min per block) is included alongside Q4_0 / Q8_0 / IQ4NL.

Host segments (hybrid runner)

Ops with no stable MIL lowering run on CPU between CoreML segments:

  • fft, log_mel, welch_peaks
  • sample, rng_normal, rng_uniform
  • native RNN / SSM: lstm, gru, rnn, mamba2 (via rlx_cpu reference kernels)
  • custom (via register_coreml_kernel)

gru / rnn / lstm avoid unfusing to huge MIL graphs.

P1–P5 infrastructure (2026-06)

  • FP16: LowerOptions::float_dtype, f16 blob weights, f16 CoreML I/O in shim, Precision::F16 via CompileOptions
  • Flexible shapes: MIL UnknownDimension + ShapeRange; runtime shape inference at predict; optional RLX_COREML_NATIVE_FLEX=1 skips deferred recompile on ANE
  • On-device dequant: Q8_0 / Q4_0 / IQ4NL / Q4_K / Q5_K / Q8_K in MIL; MoE grouped matmul too
  • Custom registry: op_registry.rs + hybrid dispatch

K-quants and IQ/TQ/MX/NV families without a MIL mul+sub lowering still host-dequant to f32 at bake time (hybrid segment or RLX_COREML_HOST_DEQUANT=1).

Weights with ≥ 10 elements go to weights/weight.bin (MILBlob format).

Introspection: chip_info(), ane_available(), MLComputePlan routing (macOS 14.4+).

Not in scope for ANE

Training / backward ops, control flow (if / while / scan), fusion internals (Fused*, ElementwiseRegion), CustomFn, QMatMul / QConv2d (int8 I/O), Gaussian splat family.

Output layout note

CoreML may pad rank-4 outputs for ANE alignment; the shim copies via stride-aware indexing, not flat memcpy.