pub fn embedding_ptx() -> &'static str
PTX assembly for embedding lookup.
One thread per (token, dimension) pair. Each thread copies one f32 from the embedding weight table to the output.