Expand description
A minimal OpenCL, WGPU, CUDA and host CPU array manipulation engine / framework written in Rust.
This crate provides the tools for executing custom array operations with the CPU, as well as with CUDA, WGPU and OpenCL devices.
This guide demonstrates how operations can be implemented for the compute devices: implement_operations.md
or to see it at a larger scale, look here custos-math or here sliced.
Examples
custos only implements four Buffer operations. These would be the write, read, copy_slice and clear operations,
however, there are also unary (device only) operations.
On the other hand, custos-math implements a lot more operations, including Matrix operations for a custom Matrix struct.
Implement an operation for CPU:
If you want to implement your own operations for all compute devices, consider looking here: implement_operations.md
use std::ops::Mul;
use custos::prelude::*;
pub trait MulBuf<T, S: Shape = (), D: Device = Self>: Sized + Device {
fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S>;
}
impl<T, S, D> MulBuf<T, S, D> for CPU
where
T: Mul<Output = T> + Copy,
S: Shape,
D: MainMemory,
{
fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, CPU, S> {
let mut out = self.retrieve(lhs.len(), (lhs, rhs));
for ((lhs, rhs), out) in lhs.iter().zip(&*rhs).zip(&mut out) {
*out = *lhs * *rhs;
}
out
}
}A lot more usage examples can be found in the tests and examples folder.
Re-exports
pub use devices::cpu::CPU;pub use devices::opencl::OpenCL;pub use devices::stack::Stack;pub use devices::*;pub use autograd::*;
Modules
- Provides tools for automatic differentiation.
- This module defines all available compute devices
- This module includes macros and functions for executing operations on the CPU. They move the supplied (CUDA, OpenCL, WGPU, …)
Buffers to the CPU and execute the operation on the CPU. Most of the time, you should actually implement the operation for the device natively, as it is typically faster. - Describes the type of allocation.
- Contains traits for generic math.
- Typical imports for using custos.
- Exposes an API for static devices. The usage is similiar to pytorch as
Buffers are moved to the gpu or another compute device via.to_gpu,.to_cl, …
Macros
- A macro that creates a
CPUBufferusing the staticCPUdevice. - If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec, as new memory is not allocated.
- If the current device supports unified memory, data is not deep-copied. This is way faster than cpu_exec_mut, as new memory is not allocated.
- Moves
nBuffers stored on another device tonCPUBuffers and executes an operation on theCPU. - Moves
nBuffers stored on another device tonCPUBuffers and executes an operation on theCPU. The results are written back to the originalBuffers. - Shadows all supplied
Buffers toCPU`Buffer’s. - Moves
Buffers toCPUBuffers. The name of the newCPUBuffers are provided by the user. The newBuffers are declared as mutable. - Takes
Buffers having a host pointer and wraps them intoCPUBuffer’s. The oldBuffers are shadowed. - Takes
Buffers having a host pointer and wraps them into mutableCPUBuffer’s. New names for theCPUBuffers are provided by the user.
Structs
- The underlying non-growable array structure of
custos. ABuffermay be encapsulated in other data structures. By default, theBufferis a f32 CPU Buffer with no statically known shape. - A
CacheTraceis a list of nodes that shows whichBuffers could use the same cache. - used to reset the cache count
- The iterator used for setting the cache count.
- A 1D shape.
- A 2D shape.
- A 3D shape.
- Uses the global count as the next index for a
Node. - A graph of
Nodes. It is typically built up during the forward process. (callingdevice.retrieve(.., (lhs, rhs))) - A node in the
Graph. - Uses the amount of nodes in the graph as the next index for a
Node. - Makes it possible to use a single number in a
Buffer. - Resolves to either a mathematical expression as string or a computed value. This is used to create generic kernels / operations over
OpenCL,CUDAandCPU.
Enums
- ‘generic’ device errors that can occur on any device.
Constants
- If the OpenCL device selected by the environment variable
CUSTOS_CL_DEVICE_IDXsupports unified memory, then this will betrue. In your case, this isfalse.
Traits
- Trait for adding a node to a graph.
- This trait is for allocating memory on the implemented device.
- Applies a function to a buffer and returns a new buffer.
- Converts ranges into a start and end index.
- Trait for implementing the clear() operation for the compute devices.
- This trait is used to clone a buffer based on a specific device type.
- A trait that allows combining math operations. (Similiar to an Iterator)
- custos v5 compatibility for “common pointers”. The commmon pointers contain the following pointers: host, opencl and cuda
- Trait for copying a slice of a buffer, to implement the slice() operation.
- This trait is the base trait for every device.
- All type of devices that can create
Buffers - A trait for downcasting errors.
- Evaluates a combined (via
Combiner) math operations chain to a value. - Returns a mutable reference to the graph.
- If the
Shapeis provides a fixed size, than this trait should be implemented. Forgot how this is useful. - Devices that can access the main memory / RAM of the host.
- The shape may be 2D or ().
- If the
autogradfeature is enabled, then this will be implemented for all types that implementTapeReturn. On the other hand, if theautogradfeature is disabled, noTapewill be returneable. - If the
no-stdfeature is disabled, this trait is implemented for all types that implementToCLSource. In this case,no-stdis disabled. - Returns the next index for a
Node. - This trait is implemented for every pointer type.
- Trait for reading buffers. Syncronizationpoint for CUDA.
- Used to shallow-copy a pointer. Use is discouraged.
- Determines the shape of a
Buffer.Shapeis used to get the size and ND-Array for a stack allocatedBuffer. - Evaluates a combined (via
Combiner) math operations chain to a valid OpenCL C (and possibly CUDA) source string. - Converts a pointer to a different
Shape. - Converts a &’static str to a
Resolve. - Converts a value to a
Resolve. - Applies the forward function of a new/cached
Bufferand returns it. If theautogradfeature is enabled, the gradient function is also calculated via the grad function. - Writes the unary gradient (with chainrule) to the lhs_grad buffer.
- Trait for writing data to buffers.
Functions
rangeresets the cache count in every iteration. The cache count is used to retrieve the same allocation in each iteration. Not addingrangeresults in allocating new memory in each iteration, which is only freed when the device is dropped.
To disable this caching behaviour, thereallocfeature can be enabled.
Type Definitions
- A type alias for Box<dyn std::error::Error + Send + Sync>
- A type alias for
Result<T, Error>.
Attribute Macros
- Expands a
CPUimplementation to aStackandCPUimplementation.