Expand description
ADR-021 K5: GPU feature-axis concat (single-chunk strided copy).
Each invocation copies one [T, src_dim] f32 row-major slab into
its slice of the concatenated [T, dst_stride] destination, at
column offset dst_offset. Launching once per chunk (with varying
dst_offset) builds the full [T, Σ src_dim_i] concatenated
tensor — exactly the shape qwen3vl.cpp:186
ggml_concat(ctx0, embeddings, deepstack_features, 0) produces.
Pure copy (no FP arithmetic) → AC-1 byte-identical.
Statics§
Functions§
- dispatch_
feature_ concat_ f32 - Copy one
[n_tokens, src_dim]f32 row-major chunk into the[n_tokens, dst_stride]destination at columndst_offset. - register