//! Vectorized Q4_K GEMV kernel with coalesced u32 loads (PAR-069)
//!
//! Achieves high memory bandwidth by loading weights as u32:
//! - Each thread loads 4 consecutive bytes (8 nibbles = 8 Q4 values)
//! - 32 threads x 4 bytes = 128 bytes per warp transaction (perfectly coalesced!)
//!
//! ## Submodules
//!
//! - [`build_ptx`]: PTX code generation for the kernel
/// Vectorized Q4_K GEMV kernel with coalesced u32 loads (PAR-069)
///
/// Achieves high memory bandwidth by loading weights as u32:
/// - Each thread loads 4 consecutive bytes (8 nibbles = 8 Q4 values)
/// - 32 threads x 4 bytes = 128 bytes per warp transaction (perfectly coalesced!)