Module kernel

Expand description

Kernels.

Kernels are functions dispatched from the host that execute on the device. They are declared within modules, which create a shared scope between host and device. krnlc collects all modules and compiles them. krnl-core is a shared core library between both host and device.

use krnl::{
    macros::module,
    anyhow::Result,
    device::Device,
    buffer::{Buffer, Slice, SliceMut},
};

#[module]
mod kernels {
    #[cfg(not(target_arch = "spirv"))]
    use krnl::krnl_core;
    use krnl_core::macros::kernel;

    pub fn saxpy_impl(alpha: f32, x: f32, y: &mut f32) {
        *y += alpha * x;
    }

    // Item kernels for iterator patterns.
    #[kernel]
    pub fn saxpy(alpha: f32, #[item] x: f32, #[item] y: &mut f32) {
        saxpy_impl(alpha, x, y);
    }

    // General purpose kernels like CUDA / OpenCL.
    #[kernel]
    pub fn saxpy_global(alpha: f32, #[global] x: Slice<f32>, #[global] y: UnsafeSlice<f32>) {
        use krnl_core::buffer::UnsafeIndex;

        let global_id = kernel.global_id();
        if global_id < x.len().min(y.len()) {
            saxpy_impl(alpha, x[global_id], unsafe { y.unsafe_index_mut(global_id) });
        }
    }
}

fn saxpy(alpha: f32, x: Slice<f32>, mut y: SliceMut<f32>) -> Result<()> {
    if let Some((x, y)) = x.as_host_slice().zip(y.as_host_slice_mut()) {
        x.iter()
            .copied()
            .zip(y.iter_mut())
            .for_each(|(x, y)| kernels::saxpy_impl(alpha, x, y));
        return Ok(());
    }
    kernels::saxpy::builder()?
        .build(y.device())?
        .dispatch(alpha, x, y)
    // or
    kernels::saxpy_global::builder()?
        .build(y.device())?
        .with_global_threads(y.len() as u32)
        .dispatch(alpha, x, y)
}

fn main() -> Result<()> {
    let alpha = 2f32;
    let x = vec![1f32];
    let y = vec![0f32];
    let device = Device::builder().build().ok().unwrap_or(Device::host());
    let x = Buffer::from(x).into_device(device.clone())?;
    let mut y = Buffer::from(y).into_device(device.clone())?;
    saxpy(alpha, x.as_slice(), y.as_slice_mut())?;
    let y = y.into_vec()?;
    println!("{y:?}");
    Ok(())
}

§krnlc

Kernels are compiled with krnlc.

Compile with krnlc or krnlc -p my-crate.

Runs the equivalent of cargo expand to locate all modules.
Generates a device crate under <target-dir>/krnlc/crates/<my-crate>.
Compiles the device crate with spirv-builder.
Processes the output, validates and optimizes with spirv-tools.
Writes out to “krnl-cache.rs”, which is imported by module and kernel macros.

The cache allows packages to build with stable Rust, without recompiling kernels downstream:

__krnl_cache!("0.1.0", "
abZy8000000@}Rn2BDn}/7A)@mn.pk3NcX{F^z.@l>4nckBq6ms3D<md5goETn#<^Op/(BnRwlm5HBw3ld4M0 ..
..
");

If the version of krnlc is incompatible with the krnl version, module will emit a compiler error.

§Toolchains

To locate modules, krnlc will use the nightly toolchain. Install it with:

rustup toolchain install nightly

To compile kernels with spirv-builder, a specific nightly is required:

rustup toolchain install nightly-2023-05-27
rustup component add --toolchain nightly-2023-05-27 rust-src rustc-dev llvm-tools-preview

§Installing

With spirv-tools from the LunarG Vulkan SDK installed (will save significant compile time):

cargo +nightly-2023-05-27 install krnlc --locked --no-default-features \
 --features use-installed-tools

Otherwise:

cargo +nightly-2023-05-27 install krnlc --locked

§Metadata

krnlc can read metadata from Cargo.toml:

[package.metadata.krnlc]
# enable default features when locating modules
default-features = false
# features to enable when locating modules
features = ["zoom", "zap"]

[package.metadata.krnlc.dependencies]
# source is inherited from host target
foo = { default-features = false, features = ["foo"] }
# keys are inherited if not provided
bar = {}
# private dependency
baz = { path = "baz" }

krnl-core is automatically included as a dependency.

§Modules

The module macro declares a shared host and device scope that is visible to krnlc. The spirv arch will be used by krnlc when compiling modules for the device.

use krnl::macros::module;

#[module]
mod kernels {
    #[cfg(not(target_arch = "spirv"))]
    use krnl::krnl_core;
    use krnl_core::macros::kernel;

    #[kernel]
    pub fn foo() {}
}

Modules mut be within a module hierarchy, not within fn’s or impl blocks.

§Attributes

Additonal options can be passed via attributes:

#[module]
// Does not compile the module with krnlc, used for krnl's docs.
#[krnl(no_build)]
 // Override path to krnl when it isn't a dependency.
#[krnl(crate=foo::krnl)]
mod kernels {
    /* .. */
}

§Imports

Functions and other items are visible to other modules, and can be imported:

mod foo {
    #[module]
    pub mod bar {
        pub struct Bar;
    }
}

#[module]
mod baz {
    use super::foo::bar::Bar;
}

§Kernels

The kernel macro declares a function that executes on the device, dispatched from the host.

#[kernel]
fn foo<
    // Specialization Constants
    const U: i32,
    const V: f32,
    const W: u32,
>(
    // Kernel
    /* kernel: Kernel or ItemKernel */
    // Global Buffers
    #[global] a: Slice<f32>,
    #[global] b: UnsafeSlice<i32>,
    // Items
    #[item] c: f64,
    #[item] d: &mut u64,
    // Push Constants
    e: u8,
    f: i32,
    // Group Buffers
    #[group] g: UnsafeSlice<f32, 100>,
    #[group] h: UnsafeSlice<i32, { (W * 10 + 1) as usize }>,
) {
    /* .. */
}

§Items

Item kernels are a simple and safe abstraction for iterator patterns. Item kernels have an implcit ItemKernel argument.

Mapping a buffer with a fn:

fn scale_to_f32_impl(x: u8) -> f32 {
    x as f32 / 255.
}

#[kernel]
fn scale_to_f32(#[item] x: u8, #[item] y: &mut f32) {
    *y = scale_to_f32_impl(x);
}

if let Some(x) = x.as_host_slice() {
    let y: Vec<f32> = x.iter().copied().map(scale_to_f32_impl).collect();
    Ok(Buffer::from(y))
} else {
    let mut y = Buffer::zeros(x.device(), x.len())?;
    scale_to_f32::builder()?
        .build(x.device())?
        .dispatch(x, y.as_slice_mut())?;
    Ok(y)
}

§Push Constants

Scalar arguments without an attribute. Unlike SpecConstants, they are provided to .dispatch(..), and do not require rebuilding the kernel.

At least 128 bytes of push constants can be used, depending on the device. Each item or global argument requires 8 bytes of push constants.

§Groups, Subgroups, and Threads

Kernels without items have an implicit Kernel argument that uniquely identifies the group, subgroup, and thread.

Kernels are dispatched with groups of threads (CUDA thread blocks). Threads in a group are executed together, typically on the same processor with a shared L1 cache. This is exposed via group buffers.

Thread groups are composed of subgroups of threads (CUDA warps), similar to SIMD vector registers on a CPU. The number of threads per subgroup is a power of 2 between 1 and 128. Typical values are 32 for NVIDIA and 64 for AMD. It may range between min_subgroup_threads and max_subgroup_threads. For subgroup_threads between min_subgroup_threads and max_subgroup_threads, each subgroup in a group will have subgroup_threads threads, unless threads per group is not an exact multiple, where the last subgroup will have the remainder of threads.

§Global Buffers

Visible to all threads. Slice binds to Slice, UnsafeSlice binds to SliceMut, provided to .dispatch(..).

For best performance, consecutive threads should access consecutive elements, allowing loads and stores to be coalesced into fewer memory transactions.

§Group Buffers

Shared with all threads in the group, initialized with zeros. Can be used to minimize accesses to global buffers.

The maximum amount of memory that can be used for group buffers depends on the device. Kernels exceeding this will fail to build.

Barriers should be used as necessary to synchronize access.

#[kernel]
fn group_sum(
    #[global] x: Slice<f32>,
    #[group] x_group: UnsafeSlice<f32, 64>,
    #[global] y: UnsafeSlice<f32>,
) {
    use krnl_core::{
        buffer::UnsafeIndex,
        spirv_std::arch::workgroup_memory_barrier_with_group_sync as group_barrier
    };

    let global_id = kernel.global_id();
    let group_id = kernel.group_id();
    let thread_id = kernel.thread_id();
    unsafe {
        *x_group.unsafe_index_mut(thread_id) = x[global_id];
        // Barriers are used to synchronize access to group memory.
        // This call must be reached by all active threads in the group!
        group_barrier();
    }
    if thread_id == 0 {
        let mut acc = 0f32;
        for i in 0 .. 64 {
            unsafe {
                acc += *x_group.unsafe_index(i);
            }
        }
        unsafe {
            *y.unsafe_index_mut(group_id) = acc;
        }
    }
}

§KernelBuilder

A kernel declaration is expanded to a mod with a custom KernelBuilder and Kernel.

pub mod saxpy {
    /// Builder for creating a [`Kernel`].
    ///
    /// See [`builder()`](builder).
    pub struct KernelBuilder { /* .. */ }

    /// Creates a builder.
    ///
    /// The builder is lazily created on first call.
    ///
    /// # Errors
    /// - The kernel wasn't compiled (with `#[krnl(no_build)]` applied to `#[module]`).
    pub fn builder() -> Result<KernelBuilder>;

    impl KernelBuilder {
        /// Threads per group.
        ///
        /// Defaults to [`DeviceInfo::default_threads()`](DeviceInfo::default_threads).
        pub fn with_threads(self, threads: u32) -> Self;
        /// Builds the kernel for `device`.
        ///
        /// The kernel is cached, so subsequent calls to `.build()` with identical
        /// builders (ie threads and spec constants) may avoid recompiling.
        ///
        /// # Errors
        /// - `device` doesn't have required features.
        /// - The kernel is not supported on `device`.
        /// - [`DeviceLost`].
        pub fn build(&self, device: Device) -> Result<Kernel>;
    }

    /// Kernel.
    pub struct Kernel<G = WithGroups<false>> { /* .. */ }

    impl<G> Kernel<G> {
        /// Threads per group.
        pub fn threads(&self) -> u32;
        /// Global threads to dispatch.
        ///
        /// Implicitly declares groups by rounding up to the next multiple of threads.
        pub fn with_global_threads(self, global_threads: u32) -> Kernel<WithGroups<true>>;
        /// Groups to dispatch.
        ///
        /// For item kernels, if not provided, is inferred based on item arguments.
        pub fn with_groups(self, groups: u32) -> Kernel<WithGroups<true>>;
    }

    impl Kernel<WithGroups<true>> {
        /// Dispatches the kernel.
        ///
        /// - Waits for immutable access to slice arguments.
        /// - Waits for mutable access to mutable slice arguments.
        /// - Blocks until the kernel is queued.
        ///
        /// # Errors
        /// - [`DeviceLost`].
        /// - The kernel could not be queued.
        pub fn dispatch(&self, alpha: f32, x: Slice<f32>, y: SliceMut<f32>) -> Result<()>;
    }
}

View the generated code and documentation with cargo doc. Also use --document-private-items if the item is private.

The builder() method returns a KernelBuilder for creating a Kernel. This will fail if the kernel wasn’t compiled with no_build. The builder is cached so that subsequent calls are trivial.

The number of threads per group can be set via .with_threads(..). It will default to DeviceInfo::default_threads() if not provided.

Building a kernel is an expensive operation, so it is cached within Device. Subsequent calls to .build(..) with identical builders (threads and spec constants) may avoid recompiling.

§Features

Kernels implicitly declare Features based on types and or operations used. If the device does not support these features, .build(..) will return an error.

See DeviceInfo::features().

§Specialization

SpecConstants are declared like const generic parameters, but are not const when compiling in Rust. They may be used to define the length of a Group Buffer. At runtime, SpecConstants are provided to the builder via .specialize(..). During .build(..), they are converted to constants.

#[repr(u32)]
enum Op {
    Add = 1,
    Sub = 2,
}

#[kernel]
fn binary<const OP: u32>(
    #[item] a: f32,
    #[item] b: f32,
    #[item] c: &mut f32,
) {
    if OP == Op::Add as u32 {
        *c = a + b
    } else if OP == Op::Sub as u32 {
        *c = a - b
    } else {
        panic!("Invalid op: {OP}");
    }
}

binary::builder()?
    .specialize(Op::Add as u32)
    .build(device)?;

§Dispatch

Once built, the groups to dispatch may be set via .with_groups(..), or .with_global_threads(..) which rounds up to the next multiple of threads. Item kernels infer the global_threads based on the number of items.

The .dispatch(..) method blocks until the kernel is queued. One kernel can be queued while another is executing.

When a kernel begins executing, the device will begin processing one or more groups in parallel, untill all groups have finished.

Synchronization is automatically performed as necessary between kernels and when transfering buffers to and from devices. Device::wait() can be used to explicitly wait for prior operations to complete.

§SPIR-V

Binary intermediate representation for graphics shaders that can be used with Vulkan. Kernels are implemented as compute shaders targeting Vulkan 1.2.

spirv-std is a std library for the spirv arch, for use with rust-gpu.

§Asm

The asm! macro can be used with the spirv arch, see inline-asm.

§DebugPrintf

debug_printf! and debug_printfln! will print formatted output to stderr.

#[kernel]
fn foo(x: f32) {
    use krnl_core::spirv_std; // spirv_std must be in scope
    use spirv_std::macros::debug_printfln;

    unsafe {
        debug_printfln!("Hello World!");
    }
}

Pass --debug-printf to krnlc to enable. DebugPrintf will disable many optimizations and include debug info, significantly increasing the size of both the cache and kernels at runtime.

The DebugPrintf Validation Layer must be active when the device is created or DebugPrintf instructions will be removed.

[Device(0@7f6f3c9724d0) crate::kernels::foo<threads=1>] Validation Information: [ UNASSIGNED-DEBUG-PRINTF ]
Object 0: handle = 0x7f6f3c9724d0, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x92394c89 | Hello World!

§Panics

§Without DebugPrintf

Panics in kernels will abort the thread. This will not stop other threads from continuing, and the panic will not be caught from the host.

§With DebugPrintf

Kernels will block on completion, and return an error on panic. When a kernel thread panics, a message will be printed to stderr, including the device, the name, the panic message, and a backtrace of calls leading to the panic.

[Device(0@7f89289724d0) crate::kernels::foo<threads=2, N=4>] Validation Information: [ UNASSIGNED-DEBUG-PRINTF ] Object 0: handle = 0x7f89289b6070, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x92394c89 | Command buffer (0x7f892896d7f0). Compute Dispatch Index 0. Pipeline (0x7f8928a95fb0). Shader Module (0x7f8928a9d500). Shader Instruction Index = 137.  Stage = Compute.  Global invocation ID (x, y, z) = (1, 0, 0 )
[Rust panicked at ~/.cargo/git/checkouts/krnl-699626729fecae20/db00d07/krnl-core/src/buffer.rs:169:20]
 index out of bounds: the len is 1 but the index is 1
      in <krnl_core::buffer::UnsafeSliceRepr<u32> as krnl_core::buffer::UnsafeIndex<usize>>::unsafe_index_mut
        called at ~/.cargo/git/checkouts/krnl-699626729fecae20/db00d07/krnl-core/src/buffer.rs:229:18
      by <krnl_core::buffer::BufferBase<krnl_core::buffer::UnsafeSliceRepr<u32>> as krnl_core::buffer::UnsafeIndex<usize>>::unsafe_index_mut
        called at src/kernels.rs:15:10
      by crate::kernels::foo::foo
        called at src/kernels.rs:11:1
      by crate::kernels::foo
        called at src/kernels.rs:12:8
      by crate::kernels::foo(__krnl_global_id = vec3(1, 0, 0), __krnl_groups = vec3(1, 1, 1), __krnl_group_id = vec3(0, 0, 0), __krnl_subgroups = 1, __krnl_subgroup_id = 0, __krnl_subgroup_threads = 32, __krnl_subgroup_thread_id = 1, __krnl_thread_id = vec3(1, 0, 0))
 Unable to find SPIR-V OpLine for source information.  Build shader with debug info to get source information.
thread 'foo' panicked at src/lib.rs:50:10:
called `Result::unwrap()` on an `Err` value: Kernel `crate::kernels::foo<threads=2, N=4>` panicked!

Note: The validation layer can be configured to redirect messages to stdout. This will prevent krnl from receiving a callback and returning an error in case of a panic.

Module kernelCopy item path