Module write_slice

Expand description

write_slice plan — Phase 13.1 trailblazer.

write_slice(dest, source, ranges) -> dest:

dest[start_0..end_0, ..., start_{N-1}..end_{N-1}] = source

Assign semantics (not accumulate — that distinguishes WriteSlicePlan from ScatterAddPlan). Drives Fuel team’s persistent KV-cache append during autoregressive decoding — step 9c E.3.3 of their Phase 7.6 integration.

Dtype coverage spans the entire baracuda element bank via byte-width dispatch (sizeof(T) ∈ {1, 2, 4, 8, 16}), with a separate nibble-packed kernel for [S4] / [U4]. Bound is T: DeviceRepr + Copy + 'static (same as TensorRef) so the same plan covers Element-family, IntElement-family, and FpElement-family dtypes uniformly.

No backward — write_slice is non-differentiable in Fuel’s autograd model.

§Fast paths

Full-width minor axes — when ranges[i] == (0, dest_shape[i]) for all i > 0, the source maps to one contiguous chunk of dest starting at offset start_0 * stride[0] * sizeof(T). A single cuMemcpyDtoDAsync does the copy. This is the KV-cache append shape and the most performance-critical case.
Whole dest covered — when source-shape == dest-shape and ranges fully cover dest, a single cuMemcpyDtoDAsync of the whole buffer (degenerate of case 1).
Otherwise — generic per-slab-element kernel. One thread per source element computes the dest linear offset from the slab coord shifted by range_start.

§S4 / U4 constraint

Nibble-packed dtypes pack two elements per u8. To avoid read-modify-write across the byte boundary, the trailblazer requires that start_{N-1} and end_{N-1} on the innermost axis be even. A non-even innermost range returns Error::Unsupported at select time.

Structs§

WriteSliceArgs: Args bundle for a write_slice launch.
WriteSliceDescriptor: Descriptor for a write_slice op.
WriteSlicePlan: write_slice plan.