Portable scalar transpose using a 64-bit gather and the classic 8x8 bit-matrix
transpose. Used as the fallback when no SIMD implementation is available.
Untranspose a T-width comparison mask (1024 bits) from FastLanes layout into logical row
order, dispatching to the best implementation. For T = u64 this is the canonical FastLanes
bit untranspose; narrower T undo the per-lane packing produced by unpack_cmp for that width.