pub struct Tile<const SIZE: usize> { /* private fields */ }Expand description
A thread block tile of SIZE threads.
All threads within a tile are guaranteed active — shuffle is always safe.
Created by partitioning a Warp<All> via warp.tile::<N>().
§Supported Sizes
4, 8, 16, 32 — matching NVIDIA’s cooperative groups API. Only power-of-two sizes that divide 32 are valid.
Implementations§
Source§impl<const SIZE: usize> Tile<SIZE>where
Tile<SIZE>: ValidTileSize,
impl<const SIZE: usize> Tile<SIZE>where
Tile<SIZE>: ValidTileSize,
Sourcepub fn shuffle_xor<T: GpuValue + GpuShuffle>(
&self,
data: PerLane<T>,
mask: u32,
) -> PerLane<T>
pub fn shuffle_xor<T: GpuValue + GpuShuffle>( &self, data: PerLane<T>, mask: u32, ) -> PerLane<T>
Shuffle XOR within the tile.
Each thread exchanges with the thread at (thread_rank XOR mask) within
the tile. Caller must ensure mask < SIZE (no automatic clamping).
Always safe: all SIZE threads in the tile participate.
On GPU: emits shfl.sync.bfly.b32 with c = ((32-SIZE)<<8)|0x1F,
confining the shuffle to SIZE-lane segments.
Sourcepub fn shuffle_down<T: GpuValue + GpuShuffle>(
&self,
data: PerLane<T>,
delta: u32,
) -> PerLane<T>
pub fn shuffle_down<T: GpuValue + GpuShuffle>( &self, data: PerLane<T>, delta: u32, ) -> PerLane<T>
Shuffle down within the tile (confined to tile-sized segments).
Sourcepub fn reduce_sum<T: GpuValue + GpuShuffle + Add<Output = T>>(
&self,
data: PerLane<T>,
) -> T
pub fn reduce_sum<T: GpuValue + GpuShuffle + Add<Output = T>>( &self, data: PerLane<T>, ) -> T
Sum reduction across all tile lanes.
Uses butterfly reduction with log2(SIZE) shuffle-XOR steps.
Sourcepub fn inclusive_sum<T: GpuValue + GpuShuffle + Add<Output = T>>(
&self,
data: PerLane<T>,
) -> PerLane<T>
👎Deprecated: Not correct on any target — Hillis-Steele without lane_id guard. Use SimWarp for tested scan.
pub fn inclusive_sum<T: GpuValue + GpuShuffle + Add<Output = T>>( &self, data: PerLane<T>, ) -> PerLane<T>
Not correct on any target — Hillis-Steele without lane_id guard. Use SimWarp for tested scan.
Inclusive prefix sum within the tile.
WARNING: Not correct on any target. On CPU, shfl_up is identity,
so each stage doubles (result: val × SIZE). On GPU, lanes where
lane_id < stride get clamped (own value), doubling instead of
preserving. Needs if lane_id >= stride guard (requires lane_id()).
Retained for type-system demonstration.
Source§impl Tile<32>
impl Tile<32>
Sourcepub fn partition_16(&self) -> Tile<16>
pub fn partition_16(&self) -> Tile<16>
Sub-partition into tiles of 16.
Sourcepub fn partition_8(&self) -> Tile<8>
pub fn partition_8(&self) -> Tile<8>
Sub-partition into tiles of 8.
Sourcepub fn partition_4(&self) -> Tile<4>
pub fn partition_4(&self) -> Tile<4>
Sub-partition into tiles of 4.
Source§impl Tile<16>
impl Tile<16>
Sourcepub fn partition_8(&self) -> Tile<8>
pub fn partition_8(&self) -> Tile<8>
Sub-partition into tiles of 8.
Sourcepub fn partition_4(&self) -> Tile<4>
pub fn partition_4(&self) -> Tile<4>
Sub-partition into tiles of 4.