//! FP16 and Tensor Core Q4K Kernels
//!
//! High-performance quantized inference kernels optimized for memory bandwidth
//! and tensor core utilization.
//!
//! ## Kernels
//!
//! - [`Fp16Q4KGemvKernel`] - FP16 input/output Q4K GEMV with 4x bandwidth reduction
//! - [`TensorCoreQ4KGemmKernel`] - Tensor Core accelerated Q4K GEMM for batched decode
//! - [`MultiWarpTensorCoreQ4KGemmKernel`] - 4-warp WMMA Q4K GEMM (PMAT-045)
//! - [`InterleavedWmmaQ4KGemmKernel`] - Coalesced WMMA Q4K GEMM with interleaved weights (PMAT-091)
pub use Fp16Q4KGemvKernel;
pub use InterleavedWmmaQ4KGemmKernel;
pub use MultiWarpTensorCoreQ4KGemmKernel;
pub use TensorCoreQ4KGemmKernel;
pub use W4a16WmmaQ4KGemmKernel;