Expand description
Pre-load memory estimation (plan #35).
Borrowed from MAX’s max/python/max/pipelines/ pattern: model
peak memory is estimated before weights load. On Apple
Silicon this matters disproportionately — unified memory is
shared with the OS, so a model that “would fit on a 96 GB
Mac” can still OOM if you spawn it during a heavy Spotlight
re-index.
Three components:
- Activation working set — peak arena bytes. Already
computed by
rlx_opt::memory::plan_memory(&graph); we just expose it. - Weight bytes — sum of registered weights from a
WeightRegistry. Aliases (tied embeddings) don’t double-count. - Per-batch input bytes — bytes the user is going to
hand in via
compiled.run(). Driven by graph inputs.
MemoryEstimate::peak_bytes is the sum. [MemoryEstimate:: fits_in] takes a budget and returns the gating decision plus
a structured reason.
Structs§
- Memory
Deficit - Memory
Estimate - MoeOffload
Estimate - Estimate peak memory for running
graphon a session bound toregistry. Pure analysis — runs the memory planner internally and queries the registry for weight bytes; doesn’t compile or execute. MoE offload sizing (TIDEenable_predictive_expert_offload).
Constants§
- DEFAULT_
SOFT_ MEMORY_ FRACTION - Default soft cap as a fraction of physical RAM (stay below OOM).
Functions§
- available_
unified_ memory - Available unified-memory budget on the running machine. On
macOS reads
hw.memsizevia sysctl; everywhere else returnsNoneso callers can fall back to a user-supplied budget. - estimate
- estimate_
moe_ offload - Compute GPU expert budget from a memory budget (unified RAM or VRAM).
- llama_
decode_ bucket_ compile_ peak_ bytes - Conservative peak for one LLaMA decode graph compile with F32 params (3B class).
- llama_
decode_ oneshot_ compile_ peak_ bytes - Lazy per-step decode compile (GGUF on demand, no resident param cache).
- memory_
headroom_ bytes - Bytes remaining before the soft budget (
budget - current RSS). - process_
rss_ bytes - Current process resident set size, when available.
- soft_
memory_ budget_ bytes - Soft RSS budget:
physical_ram * soft_memory_fraction(). ReturnsNonewhen physical RAM is unknown (non-macOS without override). - soft_
memory_ fraction - Fraction of physical RAM treated as a soft working-set cap.
Override with
RLX_SOFT_MEMORY_FRACTION(e.g.0.8). - would_
exceed_ soft_ budget - True when
current_rss + additionalwould exceed the soft budget. Unknown budget → false (do not block).