Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
CUDA Async
CUDA Async lets programmers asynchronously compose DAGs of CUDA operations, and execute them on multiple devices using any async Rust runtime (such as tokio).
The design consists of three key pieces:
- Device operations, which are composed using the
DeviceOperationAPI. - Scheduling, which is done via an implementation of the
SchedulingPolicytrait. Thescheduleoperation maps instances ofDeviceOperationtoDeviceFuture. - Future submission/execution, which is carried out by awaiting on the
DeviceFuturetype within an async context.
Device Operations
The DeviceOperation<Output=T> trait exposes an API for composing device operations.
A given implementation of DeviceOperation can be converted into DeviceFuture, which implements Future<Output=T>.
The DeviceFuture type can either be spawned or awaited upon by any async runtime in Rust.
All functions in the api module construct implementations of DeviceOperation<Output=T>.
If you do this:
async
The apply operation applies the user-defined kernel operation saxpy on the arguments (a, x, y).
The arc operation converts x into an Arc<Tensor<f32>>, and the partition operation converts
y into an Partition<Tensor<f32>>.
The kernel launcher passes Arc<Tensor<f32>> and Partition<Tensor<f32>> to tile kernels as &Tensor<f32> and partitioned sub tensors &mut Tensor<f32>, respectively
(see Kernel Launching for details on how user-defined kernel arguments are safely prepared).
impl DeviceOperation objects on which await is called are implicitly converted into futures.
This implicit conversion schedules the operation to execute on the default global device,
and await submits the resulting future for execution, blocking the current thread until execution is complete.
The unzip operation is the inverse of zip:
async
The above example discards a and x after executing saxpy by awaiting on just y.
unpartition().arc() can be used to convert a Partition<Tensor<f32>> into Arc<Tensor<f32>>,
subsequently allowing y to be used as an argument to multiple device operations in parallel.
An Arc<Tensor<f32>> can be converted into a Tensor<f32> in the usual way, which
requires the reference count of the Arc<Tensor<f32>> to be 1.
Scheduling
The DeviceFuture struct represents a scheduled device operation. A scheduled operation has resources assigned to it.
You can use the DeviceOperation::schedule method to schedule a device operation on a particular device:
async
The above invocation of schedule uses the global policy defined for device 1 to assign resources to z.
Actual execution is deferred until await is invoked.
Efficient Execution
Consider the following program:
async
The above implementation is correct but inefficient:
Whenever we invoke await on a DeviceOperation, we require synchronization with the async runtime.
We can instead submit a single future for execution, letting the scheduling policy order device operations
and synchronize with the async runtime once:
async
The default scheduling policy executes the above composition of operations on the same stream, which executes operations in order without synchronization overhead. The runtime is notified when the computations are complete via a CUDA host callback, which is placed on the stream after all operations are submitted to the stream.
A similar procedure takes place when join is called on a tuple of operations prior to await.
Kernel Launch
Consider the following saxpy kernel:
The kernel expects a reference to x, and a mutable reference to y.
We provide a reference to x by wrapping it in an Arc on the host-side.
The arc method can be called directly on a host-side tensor to obtain Arc<Tensor<T>>.
To provide safe mutable access to y, it must be partitioned into sub-tensors on the host-side.
The partition method can be called on a host-side tensor to obtain an impl Partition<Tensor<T>>.
When an impl Partition<Tensor<T>> is passed to a kernel function, it provides mutable access to a disjoint sub-tensor
in the tensor partition.
Rules:
arcandpartitioncan be called on animpl DeviceOperation<Output=Tensor<T>>or directly on aTensor<T>.- If you have a
Partition<Tensor<T>>, you canunwrapit intoTensor<T>or callarcon it to obtain anArc<Tensor<T>>. - If you have an
Arc<Tensor<T>>, you can onlyunwrapit intoTensor<T>and partition it if there is exactly one reference to it.
For example, to partition a 16x16 matrix into 4x4 sub-matrices and pass it to the given saxpy function, we do:
async
For an in-depth example, check out the data-parallel MLP example here.