pub fn linear_ptx() -> &'static str
PTX assembly for linear projection.
One thread per output element (batch_idx, out_feature). Each thread computes one dot product of x_row and w_row, then adds bias.