Module feature_concat

Expand description

ADR-021 K5: GPU feature-axis concat (single-chunk strided copy).

Each invocation copies one [T, src_dim] f32 row-major slab into its slice of the concatenated [T, dst_stride] destination, at column offset dst_offset. Launching once per chunk (with varying dst_offset) builds the full [T, Σ src_dim_i] concatenated tensor — exactly the shape qwen3vl.cpp:186 ggml_concat(ctx0, embeddings, deepstack_features, 0) produces.