Crate custos

Source
Expand description

A minimal OpenCL, WGPU, CUDA and host CPU array manipulation engine / framework written in Rust. This crate provides the tools for executing custom array operations with the CPU, as well as with CUDA, WGPU and OpenCL devices.
This guide demonstrates how operations can be implemented for the compute devices: implement_operations.md
or to see it at a larger scale, look here custos-math or here sliced.

§Examples

custos only implements four Buffer operations. These would be the write, read, copy_slice and clear operations, however, there are also unary (device only) operations.
On the other hand, custos-math implements a lot more operations, including Matrix operations for a custom Matrix struct.

Implement an operation for CPU: If you want to implement your own operations for all compute devices, consider looking here: implement_operations.md

use std::ops::Mul;
use custos::prelude::*;

pub trait MulBuf<T, S: Shape = (), D: Device = Self>: Sized + Device {
    fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S>;
}

impl<T, S, D> MulBuf<T, S, D> for CPU
where
    T: Mul<Output = T> + Copy,
    S: Shape,
    D: MainMemory,
{
    fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, CPU, S> {
        let mut out = self.retrieve(lhs.len(), (lhs, rhs));

        for ((lhs, rhs), out) in lhs.iter().zip(&*rhs).zip(&mut out) {
            *out = *lhs * *rhs;
        }

        out
    }
}

A lot more usage examples can be found in the tests and examples folder.

Re-exports§

pub use devices::cpu::CPU;
pub use devices::opencl::OpenCL;
pub use devices::stack::Stack;
pub use devices::*;
pub use autograd::*;

Modules§

autograd
Provides tools for automatic differentiation.
devices
This module defines all available compute devices
exec_on_cpu
This module includes macros and functions for executing operations on the CPU. They move the supplied (CUDA, OpenCL, WGPU, …) Buffers to the CPU and execute the operation on the CPU. Most of the time, you should actually implement the operation for the device natively, as it is typically faster.
flag
Describes the type of allocation.
number
Contains traits for generic math.
prelude
Typical imports for using custos.
static_api
Exposes an API for static devices. The usage is similiar to pytorch as Buffers are moved to the gpu or another compute device via .to_gpu, .to_cl, …

Macros§

buf
A macro that creates a CPU Buffer using the static CPU device.
cl_cpu_exec_unified
If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec, as new memory is not allocated.
cl_cpu_exec_unified_mut
If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec_mut, as new memory is not allocated.
cpu_exec
Moves n Buffers stored on another device to n CPU Buffers and executes an operation on the CPU.
cpu_exec_mut
Moves n Buffers stored on another device to n CPU Buffers and executes an operation on the CPU. The results are written back to the original Buffers.
to_cpu
Shadows all supplied Buffers to CPU `Buffer’s.
to_cpu_mut
Moves Buffers to CPU Buffers. The name of the new CPU Buffers are provided by the user. The new Buffers are declared as mutable.
to_raw_host
Takes Buffers having a host pointer and wraps them into CPU Buffer’s. The old Buffers are shadowed.
to_raw_host_mut
Takes Buffers having a host pointer and wraps them into mutable CPU Buffer’s. New names for the CPU Buffers are provided by the user.

Structs§

Buffer
The underlying non-growable array structure of custos. A Buffer may be encapsulated in other data structures. By default, the Buffer is a f32 CPU Buffer with no statically known shape.
CacheTrace
A CacheTrace is a list of nodes that shows which Buffers could use the same cache.
Count
used to reset the cache count
CountIntoIter
The iterator used for setting the cache count.
Dim1
A 1D shape.
Dim2
A 2D shape.
Dim3
A 3D shape.
GlobalCount
Uses the global count as the next index for a Node.
Graph
A graph of Nodes. It is typically built up during the forward process. (calling device.retrieve(.., (lhs, rhs)))
Node
A node in the Graph.
NodeCount
Uses the amount of nodes in the graph as the next index for a Node.
Num
Makes it possible to use a single number in a Buffer.
Resolve
Resolves to either a mathematical expression as string or a computed value. This is used to create generic kernels / operations over OpenCL, CUDA and CPU.

Enums§

DeviceError
‘generic’ device errors that can occur on any device.

Constants§

UNIFIED_CL_MEM
If the OpenCL device selected by the environment variable CUSTOS_CL_DEVICE_IDX supports unified memory, then this will be true. In your case, this is false.

Traits§

AddGraph
Trait for adding a node to a graph.
Alloc
This trait is for allocating memory on the implemented device.
ApplyFunction
Applies a function to a buffer and returns a new buffer.
AsRangeArg
Converts ranges into a start and end index.
ClearBuf
Trait for implementing the clear() operation for the compute devices.
CloneBuf
This trait is used to clone a buffer based on a specific device type.
Combiner
A trait that allows combining math operations. (Similiar to an Iterator)
CommonPtrs
custos v5 compatibility for “common pointers”. The commmon pointers contain the following pointers: host, opencl and cuda
CopySlice
Trait for copying a slice of a buffer, to implement the slice() operation.
Device
This trait is the base trait for every device.
DevicelessAble
All type of devices that can create Buffers
ErrorKind
A trait for downcasting errors.
Eval
Evaluates a combined (via Combiner) math operations chain to a value.
GraphReturn
Returns a mutable reference to the graph.
IsConstDim
If the Shape is provides a fixed size, than this trait should be implemented. Forgot how this is useful.
IsShapeIndep
If the Shape does not matter for a specific device Buffer, than this trait should be implemented.
MainMemory
Devices that can access the main memory / RAM of the host.
MayDim2
The shape may be 2D or ().
MayTapeReturn
If the autograd feature is enabled, then this will be implemented for all types that implement TapeReturn. On the other hand, if the autograd feature is disabled, no Tape will be returneable.
MayToCLSource
If the no-std feature is disabled, this trait is implemented for all types that implement ToCLSource. In this case, no-std is disabled.
NodeIdx
Returns the next index for a Node.
PtrType
This trait is implemented for every pointer type.
Read
Trait for reading buffers. Syncronizationpoint for CUDA.
ShallowCopy
Used to shallow-copy a pointer. Use is discouraged.
Shape
Determines the shape of a Buffer. Shape is used to get the size and ND-Array for a stack allocated Buffer.
ToCLSource
Evaluates a combined (via Combiner) math operations chain to a valid OpenCL C (and possibly CUDA) source string.
ToDim
Converts a pointer to a different Shape.
ToMarker
Converts a &’static str to a Resolve.
ToVal
Converts a value to a Resolve.
UnaryElementWiseMayGrad
Applies the forward function of a new/cached Buffer and returns it. If the autograd feature is enabled, the gradient function is also calculated via the grad function.
UnaryGrad
Writes the unary gradient (with chainrule) to the lhs_grad buffer.
WithShape
Trait for creating Buffers with a Shape. The Shape is inferred from the array.
WriteBuf
Trait for writing data to buffers.

Functions§

range
range resets the cache count in every iteration. The cache count is used to retrieve the same allocation in each iteration. Not adding range results in allocating new memory in each iteration, which is only freed when the device is dropped.
To disable this caching behaviour, the realloc feature can be enabled.

Type Aliases§

Error
A type alias for Box<dyn std::error::Error + Send + Sync>
Result
A type alias for Result<T, Error>.

Attribute Macros§

impl_stack
Expands a CPU implementation to a Stack and CPU implementation.