Expand description
Pre-load memory estimation (plan #35).
Borrowed from MAX’s max/python/max/pipelines/ pattern: model
peak memory is estimated before weights load. On Apple
Silicon this matters disproportionately — unified memory is
shared with the OS, so a model that “would fit on a 96 GB
Mac” can still OOM if you spawn it during a heavy Spotlight
re-index.
Three components:
- Activation working set — peak arena bytes. Already
computed by
rlx_opt::memory::plan_memory(&graph); we just expose it. - Weight bytes — sum of registered weights from a
WeightRegistry. Aliases (tied embeddings) don’t double-count. - Per-batch input bytes — bytes the user is going to
hand in via
compiled.run(). Driven by graph inputs.
MemoryEstimate::peak_bytes is the sum. [MemoryEstimate:: fits_in] takes a budget and returns the gating decision plus
a structured reason.
Structs§
- Memory
Deficit - Memory
Estimate - MoeOffload
Estimate - Estimate peak memory for running
graphon a session bound toregistry. Pure analysis — runs the memory planner internally and queries the registry for weight bytes; doesn’t compile or execute. MoE offload sizing (TIDEenable_predictive_expert_offload).
Functions§
- available_
unified_ memory - Available unified-memory budget on the running machine. On
macOS reads
hw.memsizevia sysctl; everywhere else returnsNoneso callers can fall back to a user-supplied budget. - estimate
- estimate_
moe_ offload - Compute GPU expert budget from a memory budget (unified RAM or VRAM).