SAM v1 — Meta's Segment Anything image-segmentation model.
Phasing
Phase 1 (this commit) lands the image encoder end-to-end:
- Host-side preprocessing (resize-to-1024, ImageNet pixel normalization, zero-pad to 1024×1024, patch embedding via Conv2d-as-matmul).
- IR graph for the 12 encoder blocks with windowed + global attention, decomposed relative position embeddings, plain GELU-tanh MLPs, pre-norm residual structure.
- IR neck (Conv2d 1×1 → LN2d → Conv2d 3×3 → LN2d →
[256, 64, 64]).
Phase 1 status: 100% numerical parity with candle's
ImageEncoderViT::forward() on real sam_vit_b_01ec64.safetensors
weights — max |Δ| = 7.15e-6 on the 1×256×64×64 image embeddings
(full 12-layer ViT-B at 1024×1024 input). Phase-1 bisect env vars
remain in tests/sam_parity.rs for future debugging:
RLX_SAM_DEBUG_DEPTH=N— run only the first N encoder blocksRLX_SAM_DEBUG_NO_RELPOS=1— disable decomposed relative posRLX_SAM_DEBUG_FORCE_GLOBAL=1— force every block to use global attnRLX_SAM_DEBUG_ZERO_RELH=1/RLX_SAM_DEBUG_ZERO_RELW=1— zero a single rel_pos axis (data only — the matmul + add still execute)
Phase 2 (next commit) lands the prompt encoder + mask decoder:
- Random Fourier positional encoding, point/box/mask embeddings.
- Two-way transformer between prompt tokens and image embeddings.
- ConvTranspose2d upscaling (IR) + hypernetwork MLPs for mask + IoU output.
Weight key convention matches Meta / candle exactly so the
lmz/candle-sam safetensors checkpoints load with no remapping.