Module memory_estimate

Expand description

Pre-load memory estimation (plan #35).

Borrowed from MAX’s max/python/max/pipelines/ pattern: model peak memory is estimated before weights load. On Apple Silicon this matters disproportionately — unified memory is shared with the OS, so a model that “would fit on a 96 GB Mac” can still OOM if you spawn it during a heavy Spotlight re-index.

Three components:

Activation working set — peak arena bytes. Already computed by rlx_opt::memory::plan_memory(&graph); we just expose it.
Weight bytes — sum of registered weights from a WeightRegistry. Aliases (tied embeddings) don’t double-count.
Per-batch input bytes — bytes the user is going to hand in via compiled.run(). Driven by graph inputs.

MemoryEstimate::peak_bytes is the sum. [MemoryEstimate:: fits_in] takes a budget and returns the gating decision plus a structured reason.

Structs§

MemoryDeficit
MemoryEstimate
MoeOffloadEstimate: Estimate peak memory for running graph on a session bound to registry. Pure analysis — runs the memory planner internally and queries the registry for weight bytes; doesn’t compile or execute. MoE offload sizing (TIDE enable_predictive_expert_offload).

Constants§

DEFAULT_SOFT_MEMORY_FRACTION: Default soft cap as a fraction of physical RAM (stay below OOM).

Functions§

available_unified_memory: Available unified-memory budget on the running machine. On macOS reads hw.memsize via sysctl; everywhere else returns None so callers can fall back to a user-supplied budget.
estimate
estimate_moe_offload: Compute GPU expert budget from a memory budget (unified RAM or VRAM).
llama_decode_bucket_compile_peak_bytes: Conservative peak for one LLaMA decode graph compile with F32 params (3B class).
llama_decode_oneshot_compile_peak_bytes: Lazy per-step decode compile (GGUF on demand, no resident param cache).
memory_headroom_bytes: Bytes remaining before the soft budget (budget - current RSS).
process_rss_bytes: Current process resident set size, when available.
soft_memory_budget_bytes: Soft RSS budget: physical_ram * soft_memory_fraction(). Returns None when physical RAM is unknown (non-macOS without override).
soft_memory_fraction: Fraction of physical RAM treated as a soft working-set cap. Override with RLX_SOFT_MEMORY_FRACTION (e.g. 0.8).
would_exceed_soft_budget: True when current_rss + additional would exceed the soft budget. Unknown budget → false (do not block).