1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
pub use Array;
pub use ;
pub use ;
pub use ;
pub use IntoShape;
pub use Stream;
pub use version;
/// Model IO — safetensors + GGUF load/save (local files only).
/// Hand-written `core::arch` SIMD kernels for the host-CPU numeric
/// loops mlxrs runs itself (audio DSP / preprocessing) — *not* the
/// MLX-delegated tensor math. Scalar reference + `aarch64` NEON
/// backend behind a runtime-detection dispatcher.
///
/// **Always compiled** so any caller (e.g. `audio`) can rely on it —
/// there is no `simd` cargo feature. Whether the NEON backend runs is
/// gated purely on `#[cfg(target_arch = "aarch64")]` + runtime CPU
/// detection; on every other target the dispatchers route to the
/// always-compiled scalar path. The `--cfg mlxrs_force_scalar` build
/// escape forces the scalar path even on a NEON-capable host.
/// Function transforms — autograd (`value_and_grad`/`grad`/`vjp`/`jvp`),
/// custom-VJP overrides, gradient checkpointing, and bulk eval / async-eval.
/// Mirrors `mlx-swift`'s `MLX.Transforms` (`Transforms.swift`,
/// `Transforms+Eval.swift`, `Transforms+Grad.swift`) and `mlx.core`
/// autograd. Always compiled (no feature gate).
/// Language Model (LM) — text-only inference. Stub in M1; port lands in M3
/// (loader, tokenizer, sampling, generation loop). Per-model architectures
/// (Llama/Qwen/Mistral/etc.) are added per-usecase, not bulk-ported from
/// mlx-lm/models/.
/// Vision-Language Model (VLM) — multimodal inference. Stub in M1; port lands
/// in M4 (image processors, chat-template shims, loader). Per-model
/// architectures (Qwen-VL/LLaVA/etc.) are added per-usecase, not bulk-ported
/// from mlx-vlm/models/.
/// Audio (TTS/STT/STS) — speech inference. Stub in M1; port lands in M5
/// (audio I/O, pipeline scaffolding). Per-model architectures
/// (Whisper/Sesame/etc.) are added per-usecase, not bulk-ported from
/// mlx-audio/models/.
/// Embedding utilities — pooling strategies (+ unified dispatcher),
/// parameterized normalization, fused post-pool LayerNorm/RMSNorm
/// (applied to the pooled sentence vector, matching swift `Pooling`'s
/// `pool → optional layer/rms-norm → optional matryoshka truncation →
/// optional L2-normalize` pipeline; *not* token-level pre-pool
/// normalization, which is part of the model architecture and out of
/// scope), `sentence-transformers` pooling-config parsing, and similarity.
/// Ported (M3) from `mlx-embeddings` (`models/pooling.py`,
/// `models/base.py`, `utils.py`) and swift `MLXEmbedders`
/// (`Pooling.swift`, `MLXArray+Helper.swift`). Per-model architectures
/// (BERT/XLM-RoBERTa/Qwen3-embed/etc.), loaders, tokenizer integration,
/// model-id registries, and `generate`/`load` are out of scope
/// (no-model-arch rule).
/// Tokenizer support — HF `tokenizers` integration, streaming detokenizers,
/// chat-template rendering, and tool-call parsing. Port lands in M3, ported
/// from `mlx-lm`'s `tokenizer_utils.py` + `chat_templates/` + `tool_parsers/`
/// and cross-referenced against `mlx-swift-lm`'s `MLXLMCommon` tokenizer /
/// tool abstractions. Model-specific tokenizer registration (the Python
/// `NewlineTokenizer`) is per-model architecture and intentionally out of
/// scope. Enabled transitively by `lm`, `vlm`, and `embeddings`.
/// Operator overloads (`&a + &b`, `&a - &b`, `&a * &b`, `&a / &b`, `-&a`).
/// Gated; OFF by default. Panics on shape/dtype error — see module docs.