mlxrs 0.1.0

Safe Rust bindings for Apple's MLX array framework, with LM, VLM, audio, and embeddings support
docs.rs failed to build mlxrs-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Safe Rust bindings for Apple MLX — core arrays & ops, plus LM, VLM, audio, and embeddings

Overview

mlxrs provides safe, idiomatic Rust bindings to MLX — Apple's array framework for machine learning on Apple silicon — through the mlx-c FFI layer. As in MLX, operations build a lazy computation graph and run only when a result is read (to_vec, item) or eval is called, preserving the unified-memory CPU/GPU execution model.

On top of the core Array / Dtype / ops surface, optional features port the higher-level support of MLX's companion projects — mlx-lm, mlx-vlm, mlx-audio, and mlx-embeddings (loaders, tokenizers, KV-caches, samplers, quantization, generation loops, audio DSP, pooling).

Installation

Core only:

[dependencies]
mlxrs = "0.1"

Or enable one of the higher-level feature sets:

# language models
[dependencies]
mlxrs = { version = "0.1", features = ["lm"] }
# vision-language models (implies lm)
[dependencies]
mlxrs = { version = "0.1", features = ["vlm"] }
# audio (implies lm)
[dependencies]
mlxrs = { version = "0.1", features = ["audio"] }
# embedding utilities
[dependencies]
mlxrs = { version = "0.1", features = ["embeddings"] }

Quick start

use mlxrs::Array;

fn main() -> mlxrs::Result<()> {
    let a = Array::from_slice::<f32>(&[1.0, 2.0, 3.0], &[3])?;
    let b = Array::from_slice::<f32>(&[10.0, 20.0, 30.0], &[3])?;

    // Ops build a lazy graph; reading the result forces evaluation.
    let mut c = a.add(&b)?;
    assert_eq!(c.to_vec::<f32>()?, vec![11.0, 22.0, 33.0]);

    Ok(())
}

The fallible a.add(&b)? form is always available; operator overloads (&a + &b) are opt-in behind unstable-ops-overload (see Caveats).

Features

Higher-level surfaces are off by default — enable what you need:

  • lm — language models: tokenizers (BPE / SPM / chat templates / tool parsing), KV-caches, samplers + logits processors, quantization, LoRA/DoRA, optimizers, and the generation loop.
  • vlm — vision-language models (implies lm): image preprocessing, prompt assembly, multimodal generation.
  • audio — audio (implies lm): STFT/mel DSP, WAV I/O, STT/TTS, playback.
  • embeddings — embedding-model loading, pooling, and the encode pipeline.

Finer-grained flags (individual tokenizer-*, gguf, llguidance, unstable-ops-overload) are listed in mlxrs/Cargo.toml and the API docs.

Platform support

mlxrs targets aarch64-apple-darwin (Apple silicon). Other platforms (x86_64-apple-darwin, Linux + CUDA, distributed) are roadmapped.

Caveats

  • Array is !Send and !Sync — single-thread use only. The underlying C++ array_desc is shared by Array::try_clone (refcount-bumped) and mutates non-atomic state internally, so cross-thread sharing is unsound without external synchronization. Array does not implement Clone; the only duplication is the fallible try_clone. There is no shared-array wrapper — to use array data on another thread, extract owned data via to_vec / item (which yield Send values) and move that.
  • GPU work is single-stream serialized per thread — the internal default-stream is per-thread and maps to one Metal command queue per thread. mlxrs exposes a public Stream/Device API; note Stream is a thread-affine, non-RAII handle: it is !Send + !Sync, Drop frees only the mlx-c handle box (mlx has no per-stream teardown), and Stream::new_on permanently grows mlx's process-global stream state — so allocate a bounded set at startup, never per request/task. The only reclaim path is the bulk, end-of-thread Stream::clear_current_thread_streams() (a worker's last mlx action before it exits), not per-value lifetime control.
  • Async Metal kernel failures bypass Result and abort the process — the rc/sentinel chain only catches synchronous errors. A set_terminate-style recovery shim is not implementable (mlx-c exposes no hook); only diagnostics are planned.
  • Each thread that calls into mlxrs allocates a GPU stream that lives until process exit. mlxrs is designed for a small, long-lived worker pool — patterns that spawn a fresh OS thread per request will accumulate streams without bound.
  • Operator overloads (&a + &b, -&a, etc.) are gated behind the unstable-ops-overload feature flag and panic on shape/dtype error. Library authors must NEVER enable it transitively (Cargo features unionize across the dep graph). End-user binaries may opt in for prototyping. The fallible a.add(&b)? form is always available and is the load-bearing API.
  • Per-model architectures for the lm / vlm / audio / embeddings features are added per-usecase rather than bulk-ported from the upstream Python projects. These features ship the support surface (loaders, tokenizers, pooling, generation loops, processors, audio I/O) — not the model implementations.

License

mlxrs is under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE, LICENSE-MIT for details.

Copyright (c) 2026 FinDIT Studio authors.