Expand description
A minimal OpenCL, WGPU, CUDA and host CPU array manipulation engine / framework written in Rust.
This crate provides the tools for executing custom array operations with the CPU, as well as with CUDA, WGPU and OpenCL devices.
This guide demonstrates how operations can be implemented for the compute devices: implement_operations.md
or to see it at a larger scale, look here custos-math or here sliced.
Examples
custos only implements four Buffer
operations. These would be the write
, read
, copy_slice
and clear
operations,
however, there are also unary (device only) operations.
On the other hand, custos-math implements a lot more operations, including Matrix operations for a custom Matrix struct.
Implement an operation for CPU
:
If you want to implement your own operations for all compute devices, consider looking here: implement_operations.md
use std::ops::Mul;
use custos::prelude::*;
pub trait MulBuf<T, S: Shape = (), D: Device = Self>: Sized + Device {
fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S>;
}
impl<T, S, D> MulBuf<T, S, D> for CPU
where
T: Mul<Output = T> + Copy,
S: Shape,
D: MainMemory,
{
fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, CPU, S> {
let mut out = self.retrieve(lhs.len(), (lhs, rhs));
for ((lhs, rhs), out) in lhs.iter().zip(&*rhs).zip(&mut out) {
*out = *lhs * *rhs;
}
out
}
}
A lot more usage examples can be found in the tests and examples folder.
Re-exports
pub use devices::cpu::CPU;
pub use devices::opencl::OpenCL;
pub use devices::stack::Stack;
pub use devices::*;
pub use autograd::*;
Modules
- Provides tools for automatic differentiation.
- This module defines all available compute devices
- This module includes macros and functions for executing operations on the CPU. They move the supplied (CUDA, OpenCL, WGPU, …)
Buffer
s to the CPU and execute the operation on the CPU. Most of the time, you should actually implement the operation for the device natively, as it is typically faster. - Describes the type of allocation.
- Contains traits for generic math.
- Typical imports for using custos.
- Exposes an API for static devices. The usage is similiar to pytorch as
Buffer
s are moved to the gpu or another compute device via.to_gpu
,.to_cl
, …
Macros
- A macro that creates a
CPU
Buffer
using the staticCPU
device. - If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec, as new memory is not allocated.
- If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec_mut, as new memory is not allocated.
- Moves
n
Buffer
s stored on another device ton
CPU
Buffer
s and executes an operation on theCPU
. - Moves
n
Buffer
s stored on another device ton
CPU
Buffer
s and executes an operation on theCPU
. The results are written back to the originalBuffer
s. - Shadows all supplied
Buffer
s toCPU
`Buffer’s. - Moves
Buffer
s toCPU
Buffer
s. The name of the newCPU
Buffer
s are provided by the user. The newBuffer
s are declared as mutable. - Takes
Buffer
s having a host pointer and wraps them intoCPU
Buffer
’s. The oldBuffer
s are shadowed. - Takes
Buffer
s having a host pointer and wraps them into mutableCPU
Buffer
’s. New names for theCPU
Buffer
s are provided by the user.
Structs
- The underlying non-growable array structure of
custos
. ABuffer
may be encapsulated in other data structures. By default, theBuffer
is a f32 CPU Buffer with no statically known shape. - A
CacheTrace
is a list of nodes that shows whichBuffer
s could use the same cache. - used to reset the cache count
- The iterator used for setting the cache count.
- A 1D shape.
- A 2D shape.
- A 3D shape.
- Uses the global count as the next index for a
Node
. - A graph of
Node
s. It is typically built up during the forward process. (callingdevice.retrieve(.., (lhs, rhs))
) - A node in the
Graph
. - Uses the amount of nodes in the graph as the next index for a
Node
. - Makes it possible to use a single number in a
Buffer
. - Resolves to either a mathematical expression as string or a computed value. This is used to create generic kernels / operations over
OpenCL
,CUDA
andCPU
.
Enums
- ‘generic’ device errors that can occur on any device.
Constants
- If the OpenCL device selected by the environment variable
CUSTOS_CL_DEVICE_IDX
supports unified memory, then this will betrue
. In your case, this isfalse
.
Traits
- Trait for adding a node to a graph.
- This trait is for allocating memory on the implemented device.
- Applies a function to a buffer and returns a new buffer.
- Converts ranges into a start and end index.
- Trait for implementing the clear() operation for the compute devices.
- This trait is used to clone a buffer based on a specific device type.
- A trait that allows combining math operations. (Similiar to an Iterator)
- custos v5 compatibility for “common pointers”. The commmon pointers contain the following pointers: host, opencl and cuda
- Trait for copying a slice of a buffer, to implement the slice() operation.
- This trait is the base trait for every device.
- All type of devices that can create
Buffer
s - A trait for downcasting errors.
- Evaluates a combined (via
Combiner
) math operations chain to a value. - Returns a mutable reference to the graph.
- If the
Shape
is provides a fixed size, than this trait should be implemented. Forgot how this is useful. - Devices that can access the main memory / RAM of the host.
- The shape may be 2D or ().
- If the
autograd
feature is enabled, then this will be implemented for all types that implementTapeReturn
. On the other hand, if theautograd
feature is disabled, noTape
will be returneable. - If the
no-std
feature is disabled, this trait is implemented for all types that implementToCLSource
. In this case,no-std
is disabled. - Returns the next index for a
Node
. - This trait is implemented for every pointer type.
- Trait for reading buffers. Syncronizationpoint for CUDA.
- Used to shallow-copy a pointer. Use is discouraged.
- Determines the shape of a
Buffer
.Shape
is used to get the size and ND-Array for a stack allocatedBuffer
. - Evaluates a combined (via
Combiner
) math operations chain to a valid OpenCL C (and possibly CUDA) source string. - Converts a pointer to a different
Shape
. - Converts a &’static str to a
Resolve
. - Converts a value to a
Resolve
. - Applies the forward function of a new/cached
Buffer
and returns it. If theautograd
feature is enabled, the gradient function is also calculated via the grad function. - Writes the unary gradient (with chainrule) to the lhs_grad buffer.
- Trait for writing data to buffers.
Functions
range
resets the cache count in every iteration. The cache count is used to retrieve the same allocation in each iteration. Not addingrange
results in allocating new memory in each iteration, which is only freed when the device is dropped.
To disable this caching behaviour, therealloc
feature can be enabled.
Type Definitions
- A type alias for Box<dyn std::error::Error + Send + Sync>
- A type alias for
Result<T, Error>
.
Attribute Macros
- Expands a
CPU
implementation to aStack
andCPU
implementation.