Expand description
The f32-uniform GPU arena. Like rlx-cuda / rlx-wgpu, every tensor is an
f32 slot at a byte offset in one contiguous buffer. We allocate the
arena as HOST_VISIBLE | HOST_COHERENT memory and keep it persistently
mapped, so host upload/readback is a plain memcpy with no staging
buffer or transfer command. (On discrete GPUs a DEVICE_LOCAL arena +
staging would have higher bandwidth — a documented follow-up; correctness
first.)