Crate ferrotorch_cubecl

Expand description

Portable GPU backend for ferrotorch via CubeCL.

CubeCL compiles a single kernel definition to CUDA PTX, AMD HIP/ROCm, and WGPU (Vulkan/Metal/DX12). This crate wraps CubeCL’s runtime and dispatches real #[cube] kernels to the active backend — no CPU fallbacks.

§Feature flags

Feature	Backend	GPU vendors
`cuda`	NVIDIA CUDA via PTX	NVIDIA
`wgpu`	WGPU (Vulkan/Metal)	AMD, Intel, Apple, …
`rocm`	AMD HIP (native)	AMD

Enable at least one backend feature to use GPU acceleration. Without any backend feature CubeRuntime::new returns FerrotorchError::DeviceUnavailable and CubeRuntime::auto returns None.

§Example

use ferrotorch_cubecl::{CubeDevice, CubeRuntime};

// Auto-detect the best available backend
if let Some(rt) = CubeRuntime::auto() {
    println!("Using device: {:?}", rt.device());
}

§REQ status (per `.design/ferrotorch-cubecl/lib.md`)

Full evidence rows (impl + non-test production consumer + upstream cites) live in the design doc; this synopsis is a one-line summary per REQ.

REQ	Status	Evidence
REQ-1 (public module surface)	SHIPPED	`pub mod grammar/kernels/ops/quant/runtime/storage` in `lib.rs`; consumer `ferrotorch-xpu/src/lib.rs` imports `ferrotorch_cubecl::{CubeDevice, CubeRuntime, upload_f32, wrap_kernel_output}`
REQ-2 (feature-flag wiring)	SHIPPED	`cuda`/`wgpu`/`rocm` feature gates in `Cargo.toml` + `make_client` cfg arms in `runtime.rs`; no-backend path pinned by `runtime_construction_errors_without_backend` in `ops.rs`
REQ-3 (boundary re-exports)	SHIPPED	`pub use runtime::` / `storage::` / `quant::` / `grammar::` in `lib.rs`; consumers `ferrotorch-xpu/src/lib.rs` + `ferrotorch-grammar/src/gpu_dispatch.rs` reach names via `ferrotorch_cubecl::Foo`
REQ-4 (crate-internal launch helpers)	SHIPPED	`pub(crate) fn elementwise_launch_dims` + `pub(crate) fn debug_assert_handle_capacity` in `lib.rs`; consumers `kernels::run_unary`/`run_binary_handle`, `quant::dequantize_q4_0_to_gpu`, `grammar::compute_token_mask_dfa_to_gpu`
REQ-5 (lint baseline)	SHIPPED	`#![warn(clippy::all, clippy::pedantic)]` + `#![deny(rust_2018_idioms, missing_debug_implementations)]` at top of `lib.rs`; verified by `cargo clippy -p ferrotorch-cubecl --no-default-features -- -D warnings`

Re-exports§

pub use runtime::CubeClient;
pub use runtime::CubeDevice;
pub use runtime::CubeRuntime;
pub use storage::CubeclStorageHandle;
pub use storage::cubecl_handle_of;
pub use storage::upload_f32;
pub use storage::wrap_kernel_output;
pub use quant::GgufBlockKind;
pub use quant::dequantize_q4_0_to_gpu;
pub use quant::dequantize_q4_1_to_gpu;
pub use quant::dequantize_q5_0_to_gpu;
pub use quant::dequantize_q5_1_to_gpu;
pub use quant::dequantize_q8_0_to_gpu;
pub use quant::dequantize_q8_1_to_gpu;
pub use quant::split_q4_0_blocks;
pub use quant::split_q4_1_blocks;
pub use quant::split_q5_0_blocks;
pub use quant::split_q5_1_blocks;
pub use quant::split_q8_0_blocks;
pub use quant::split_q8_1_blocks;
pub use grammar::DfaMaskInputs;
pub use grammar::compute_token_mask_dfa_to_gpu;
pub use grammar::kernel_compute_token_mask_dfa;
pub use grammar::kernel_compute_token_mask_dfa;

Modules§

grammar: GPU constrained-decoding token-mask computation.
kernels: CubeCL kernel definitions used by ferrotorch-cubecl.
ops: Portable GPU operations that dispatch through a real CubeCL [ComputeClient] and run #[cube] kernels on the selected backend.
quant: GGUF quantized-weight dequantization on the GPU.
runtime: Unified runtime selection for CubeCL backends.
storage: Concrete CubeStorageHandle implementation for ferrotorch-cubecl.