oxiui-compute-wgpu — Pure-Rust wgpu GPU-compute abstraction for OxiUI

oxiui-compute-wgpu is the shared GPU-compute layer for the COOLJAPAN ecosystem. It consolidates the repeated Instance → Adapter → Device → Queue initialisation boilerplate that headless compute workloads (sparse solvers, Lattice-Boltzmann, Monte-Carlo) each duplicated, and adds typed buffers, buffer pooling and sub-allocation, pipeline caching, dispatch helpers, a WGSL preprocessor and validator, and a set of validated built-in compute kernels. #![forbid(unsafe_code)] is enforced crate-wide; pollster blocks on wgpu's async adapter/device requests so the public API stays synchronous; bytemuck handles zero-copy Pod casting between CPU and GPU.

Targets wgpu 29. See the wgpu 29 Notes for the renamed APIs.

Installation

[dependencies]
oxiui-compute-wgpu = "0.1"

wgpu, bytemuck, and pollster are re-exported, so a single dependency declaration is enough.

Quick Start

Configure a context with the builder, double every value in a buffer on the GPU with a DispatchBuilder, and read the result back:

use oxiui_compute_wgpu::{
    bytemuck, compute_pipeline, read_back, storage_buffer_init, wgpu,
    ComputeContext, DispatchBuilder,
};

// 1. Configure and build a compute context (Err on hosts with no GPU adapter).
let Ok(ctx) = ComputeContext::builder()
    .with_power_preference(wgpu::PowerPreference::HighPerformance)
    .build()
else {
    return; // no GPU — skip gracefully
};

// 2. Upload input data to a storage buffer.
let input: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
let buffer = storage_buffer_init(&ctx.device, "values", bytemuck::cast_slice(&input));

// 3. Compile a WGSL compute shader (auto-layout from reflection).
const SHADER: &str = r#"
    @group(0) @binding(0) var<storage, read_write> data: array<f32>;

    @compute @workgroup_size(64)
    fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
        if gid.x < arrayLength(&data) {
            data[gid.x] = data[gid.x] * 2.0;
        }
    }
"#;
let pipeline = compute_pipeline(&ctx.device, SHADER, "main");

// 4. Bind, dispatch (ceil-div grid), and submit in one call.
let bind_group = ctx.device.create_bind_group(&wgpu::BindGroupDescriptor {
    label: Some("values-bind"),
    layout: &pipeline.get_bind_group_layout(0),
    entries: &[wgpu::BindGroupEntry {
        binding: 0,
        resource: buffer.as_entire_binding(),
    }],
});

DispatchBuilder::new(&pipeline)
    .bind(0, &bind_group)
    .dispatch_1d(input.len() as u32, 64)
    .label("double")
    .submit(&ctx.device, &ctx.queue);

// 5. Read the results back to the CPU.
let output: Vec<f32> = read_back(&ctx.device, &ctx.queue, &buffer, input.len());
assert_eq!(output, vec![2.0, 4.0, 6.0, 8.0]);

API Overview

Module	Key exports
`context`	`ComputeContext`, `ContextBuilder`
`buffer`	`storage_buffer_init`, `uniform_buffer`, `staging_buffer`, `read_back`, `read_back_range`, `read_back_async`, `TypedBuffer<T>`, `BufferPool`, `SubAllocator`, `mapped_storage_init`
`pipeline`	`compute_pipeline`, `checked_compute_pipeline`, `PipelineCache`, `DispatchBuilder`, `dispatch_1d`, `dispatch_2d`, `dispatch_3d`, `validate_immediates`, `encode_indirect_dispatch`
`wgsl`	`preprocess`, `validate`, `SHADER_PREFIX_SUM`, `SHADER_REDUCTION_SUM`, `SHADER_HISTOGRAM`, `SHADER_MATMUL`
`error`	`ComputeError` (`NoAdapter`, `DeviceRequest`, `OutOfMemory`, `ShaderCompilation`, `Operation`)

Built-in Kernels

Four validated WGSL kernels ship as pub const source strings in wgsl. Each uses the entry point main_cs and is compiled with compute_pipeline (or checked_compute_pipeline).

Constant	Entry point	Constraints
`SHADER_PREFIX_SUM`	`main_cs`	Inclusive scan, `f32`. Single workgroup; input length ≤ 256. Dispatch one workgroup of size 256.
`SHADER_REDUCTION_SUM`	`main_cs`	Sum reduction, `f32 → f32`. Single workgroup; input length ≤ 256. Dispatch one workgroup of size 256.
`SHADER_HISTOGRAM`	`main_cs`	`u32` values → bin counts. Up to 256 bins; each element binned as `input[i] % num_bins`. Workgroup size 64; workgroup-local atomic histogram.
`SHADER_MATMUL`	`main_cs`	Tiled `f32` matmul, M×K · K×N → M×N (row-major). 16×16 shared-memory tiles; bind `MatDims { M, K, N }` uniform at `@binding(3)`. Dispatch `ceil(N/16) × ceil(M/16) × 1`.

wgpu 29 Notes

This crate targets wgpu 29. Relevant API changes from earlier versions:

Push constants are now immediates: wgpu::Features::IMMEDIATES, ComputePass::set_immediates, and Limits::max_immediate_size. Alignment is wgpu::IMMEDIATE_DATA_ALIGNMENT (4); validate with validate_immediates.
Instance::request_adapter returns Result (not Option).
Adapter::request_device takes a single &DeviceDescriptor argument.

Feature Flags

None. All functionality is on by default. GPU access is optional at runtime: ComputeContext::try_new returns None on headless hosts (CI, VMs without GPU pass-through), and ComputeContext::new / ContextBuilder::build return ComputeError::NoAdapter, so callers can skip gracefully rather than fail.

Related Crates

oxiui — the OxiUI facade crate.
oxiui-core — RenderBackend, DrawList, geometry types, shared require_gpu! macro.
oxiui-render-soft — CPU render backend; will use compute shaders for blur/dithering.
oxiui-render-wgpu — GPU render backend; shares the device/queue with this crate.
oxiui-text — text pipeline; GPU glyph rasterization is a planned consumer.
oxiui-theme — design tokens and ShadowSpec.

License

Apache-2.0 — COOLJAPAN OU (Team Kitasan)

oxiui-compute-wgpu 0.1.2