rlx-sam 0.2.4

Segment Anything Model (SAM v1) for RLX
Documentation

SAM v1 — Meta's Segment Anything image-segmentation model.

Phasing

Phase 1 (this commit) lands the image encoder end-to-end:

  • Host-side preprocessing (resize-to-1024, ImageNet pixel normalization, zero-pad to 1024×1024, patch embedding via Conv2d-as-matmul).
  • IR graph for the 12 encoder blocks with windowed + global attention, decomposed relative position embeddings, plain GELU-tanh MLPs, pre-norm residual structure.
  • IR neck (Conv2d 1×1 → LN2d → Conv2d 3×3 → LN2d → [256, 64, 64]).

Phase 1 status: 100% numerical parity with candle's ImageEncoderViT::forward() on real sam_vit_b_01ec64.safetensors weights — max |Δ| = 7.15e-6 on the 1×256×64×64 image embeddings (full 12-layer ViT-B at 1024×1024 input). Phase-1 bisect env vars remain in tests/sam_parity.rs for future debugging:

  • RLX_SAM_DEBUG_DEPTH=N — run only the first N encoder blocks
  • RLX_SAM_DEBUG_NO_RELPOS=1 — disable decomposed relative pos
  • RLX_SAM_DEBUG_FORCE_GLOBAL=1 — force every block to use global attn
  • RLX_SAM_DEBUG_ZERO_RELH=1 / RLX_SAM_DEBUG_ZERO_RELW=1 — zero a single rel_pos axis (data only — the matmul + add still execute)

Phase 2 (next commit) lands the prompt encoder + mask decoder:

  • Random Fourier positional encoding, point/box/mask embeddings.
  • Two-way transformer between prompt tokens and image embeddings.
  • ConvTranspose2d upscaling (IR) + hypernetwork MLPs for mask + IoU output.

Weight key convention matches Meta / candle exactly so the lmz/candle-sam safetensors checkpoints load with no remapping.