Expand description
FP32 embedding-table lookup with reverse-mode autograd backward.
Used by hf2q’s ADR-020 Track 1 multi-layer model on GpuTape (iter-11d).
Forward: output[b, h] = embedding[ids[b], h]
Backward: d_embedding[id, h] = Σ_{b: ids[b] == id} dy[b, h]
The existing shaders/embedding.metal covers QUANTIZED 4-bit/6-bit
lookup for inference; this module is the FP32-everywhere variant
needed by the autograd tape.
The backward kernel is O(vocab × hidden × batch) — fine for the test fixtures (vocab ≤ a few hundred); production-scale performance (vocab=150k+) is a follow-up optimization (atomic float adds or sort-segment-sum).
Statics§
Functions§
- dispatch_
embedding_ lookup_ f32 - Encode
output[b, h] = embedding[ids[b], h]. - dispatch_
embedding_ scatter_ add_ f32 - Encode the embedding backward (scatter-add).
- register