Crate custos

source ·
Expand description

A minimal OpenCL, WGPU, CUDA and host CPU array manipulation engine / framework written in Rust. This crate provides the tools for executing custom array operations with the CPU, as well as with CUDA, WGPU and OpenCL devices.
This guide demonstrates how operations can be implemented for the compute devices: implement_operations.md
or to see it at a larger scale, look here custos-math or here sliced.

Examples

custos only implements four Buffer operations. These would be the write, read, copy_slice and clear operations, however, there are also unary (device only) operations.
On the other hand, custos-math implements a lot more operations, including Matrix operations for a custom Matrix struct.

Implement an operation for CPU: If you want to implement your own operations for all compute devices, consider looking here: implement_operations.md

use std::ops::Mul;
use custos::prelude::*;

pub trait MulBuf<T, S: Shape = (), D: Device = Self>: Sized + Device {
    fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S>;
}

impl<T, S, D> MulBuf<T, S, D> for CPU
where
    T: Mul<Output = T> + Copy,
    S: Shape,
    D: MainMemory,
{
    fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, CPU, S> {
        let mut out = self.retrieve(lhs.len(), (lhs, rhs));

        for ((lhs, rhs), out) in lhs.iter().zip(&*rhs).zip(&mut out) {
            *out = *lhs * *rhs;
        }

        out
    }
}

A lot more usage examples can be found in the tests and examples folder.

Re-exports

Modules

  • Provides tools for automatic differentiation.
  • This module defines all available compute devices
  • This module includes macros and functions for executing operations on the CPU. They move the supplied (CUDA, OpenCL, WGPU, …) Buffers to the CPU and execute the operation on the CPU. Most of the time, you should actually implement the operation for the device natively, as it is typically faster.
  • Describes the type of allocation.
  • Contains traits for generic math.
  • Typical imports for using custos.
  • Exposes an API for static devices. The usage is similiar to pytorch as Buffers are moved to the gpu or another compute device via .to_gpu, .to_cl, …

Macros

  • A macro that creates a CPU Buffer using the static CPU device.
  • If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec, as new memory is not allocated.
  • If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec_mut, as new memory is not allocated.
  • Moves n Buffers stored on another device to n CPU Buffers and executes an operation on the CPU.
  • Moves n Buffers stored on another device to n CPU Buffers and executes an operation on the CPU. The results are written back to the original Buffers.
  • Shadows all supplied Buffers to CPU `Buffer’s.
  • Moves Buffers to CPU Buffers. The name of the new CPU Buffers are provided by the user. The new Buffers are declared as mutable.
  • Takes Buffers having a host pointer and wraps them into CPU Buffer’s. The old Buffers are shadowed.
  • Takes Buffers having a host pointer and wraps them into mutable CPU Buffer’s. New names for the CPU Buffers are provided by the user.

Structs

  • The underlying non-growable array structure of custos. A Buffer may be encapsulated in other data structures. By default, the Buffer is a f32 CPU Buffer with no statically known shape.
  • A CacheTrace is a list of nodes that shows which Buffers could use the same cache.
  • used to reset the cache count
  • The iterator used for setting the cache count.
  • A 1D shape.
  • A 2D shape.
  • A 3D shape.
  • Uses the global count as the next index for a Node.
  • A graph of Nodes. It is typically built up during the forward process. (calling device.retrieve(.., (lhs, rhs)))
  • A node in the Graph.
  • Uses the amount of nodes in the graph as the next index for a Node.
  • Makes it possible to use a single number in a Buffer.
  • Resolves to either a mathematical expression as string or a computed value. This is used to create generic kernels / operations over OpenCL, CUDA and CPU.

Enums

  • ‘generic’ device errors that can occur on any device.

Constants

  • If the OpenCL device selected by the environment variable CUSTOS_CL_DEVICE_IDX supports unified memory, then this will be true. In your case, this is false.

Traits

  • Trait for adding a node to a graph.
  • This trait is for allocating memory on the implemented device.
  • Applies a function to a buffer and returns a new buffer.
  • Converts ranges into a start and end index.
  • Trait for implementing the clear() operation for the compute devices.
  • This trait is used to clone a buffer based on a specific device type.
  • A trait that allows combining math operations. (Similiar to an Iterator)
  • custos v5 compatibility for “common pointers”. The commmon pointers contain the following pointers: host, opencl and cuda
  • Trait for copying a slice of a buffer, to implement the slice() operation.
  • This trait is the base trait for every device.
  • All type of devices that can create Buffers
  • A trait for downcasting errors.
  • Evaluates a combined (via Combiner) math operations chain to a value.
  • Returns a mutable reference to the graph.
  • If the Shape is provides a fixed size, than this trait should be implemented. Forgot how this is useful.
  • If the Shape does not matter for a specific device Buffer, than this trait should be implemented.
  • Devices that can access the main memory / RAM of the host.
  • The shape may be 2D or ().
  • If the autograd feature is enabled, then this will be implemented for all types that implement TapeReturn. On the other hand, if the autograd feature is disabled, no Tape will be returneable.
  • If the no-std feature is disabled, this trait is implemented for all types that implement ToCLSource. In this case, no-std is disabled.
  • Returns the next index for a Node.
  • This trait is implemented for every pointer type.
  • Trait for reading buffers. Syncronizationpoint for CUDA.
  • Used to shallow-copy a pointer. Use is discouraged.
  • Determines the shape of a Buffer. Shape is used to get the size and ND-Array for a stack allocated Buffer.
  • Evaluates a combined (via Combiner) math operations chain to a valid OpenCL C (and possibly CUDA) source string.
  • Converts a pointer to a different Shape.
  • Converts a &’static str to a Resolve.
  • Converts a value to a Resolve.
  • Applies the forward function of a new/cached Buffer and returns it. If the autograd feature is enabled, the gradient function is also calculated via the grad function.
  • Writes the unary gradient (with chainrule) to the lhs_grad buffer.
  • Trait for creating Buffers with a Shape. The Shape is inferred from the array.
  • Trait for writing data to buffers.

Functions

  • range resets the cache count in every iteration. The cache count is used to retrieve the same allocation in each iteration. Not adding range results in allocating new memory in each iteration, which is only freed when the device is dropped.
    To disable this caching behaviour, the realloc feature can be enabled.

Type Definitions

  • A type alias for Box<dyn std::error::Error + Send + Sync>
  • A type alias for Result<T, Error>.

Attribute Macros

  • Expands a CPU implementation to a Stack and CPU implementation.