1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
/*
* SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*/
//! # cuTile Rust
//!
//! A high-performance GPU computing library that enables you to write Rust code that compiles
//! directly to CUDA kernels. cuTile Rust combines the safety and ergonomics of Rust with the
//! performance of hand-tuned GPU code.
//!
//! ## Overview
//!
//! cuTile Rust provides a complete stack for GPU programming in Rust:
//!
//! - **Core types**: GPU tensors, tiles, and partitions for structured data access
//! - **Async execution**: Modern async/await syntax for GPU operations with automatic scheduling
//! - **Kernel compilation**: Rust → MLIR → PTX/CUBIN compilation pipeline with caching
//! - **High-level API**: Familiar NumPy-like operations for tensor creation and manipulation
//! - **Built-in kernels**: Optimized implementations of common operations (GEMM, matrix-vector, etc.)
//!
//! ## Quick Start
//!
//! ### Creating and manipulating tensors
//!
//! ```rust,ignore
//! use cutile::api;
//!
//! // Create tensors on GPU
//! let x = api::ones::<f32>([1024]).await;
//! let y = api::zeros::<f32>([1024]).await;
//! let z = api::arange::<f32>(256).await;
//!
//! // All operations are async and execute on GPU
//! ```
//!
//! ### Writing custom GPU kernels
//!
//! ```rust,ignore
//! use cutile::prelude::*;
//!
//! #[cutile::module]
//! mod kernels {
//! use cutile::core::*;
//!
//! #[cutile::entry]
//! fn vector_add<T: ElementType, const N: i32>(
//! z: &mut Tensor<T, {[N]}>,
//! x: &Tensor<T, {[-1]}>,
//! y: &Tensor<T, {[-1]}>,
//! ) {
//! let tile_x = load_tile_like_1d(x, z);
//! let tile_y = load_tile_like_1d(y, z);
//! z.store(tile_x + tile_y);
//! }
//! }
//!
//! // Use the kernel with partitioned tensors
//! let x = api::ones([256]).partition([64]);
//! let y = api::ones([256]).partition([64]);
//! let z = api::zeros([256]).partition([64]);
//!
//! let result = zip!(z, x, y)
//! .apply(kernels::vector_add_apply)
//! .generics(vec!["f32".to_string(), "64".to_string()])
//! .await; // Automatically launches with grid (4, 1, 1)
//! ```
//!
//! ### Async GPU pipelines
//!
//! ```rust,ignore
//! use cutile::api;
//!
//! async fn compute_pipeline() -> Tensor<f32> {
//! // All operations execute asynchronously on GPU
//! let x = api::randn(0.0, 1.0, [1024, 1024]).await;
//! let y = api::randn(0.0, 1.0, [1024, 1024]).await;
//!
//! // Use built-in kernels
//! use cutile::kernels::linalg::gemm_apply;
//!
//! let x_part = x.partition([128, 128]);
//! let y_part = y.partition([128, 128]);
//! let z = api::zeros([1024, 1024]).partition([128, 128]);
//!
//! zip!(z, x_part, y_part)
//! .apply(gemm_apply)
//! .generics(vec!["128".to_string(), "128".to_string(),
//! "32".to_string(), "1024".to_string()])
//! .unpartition()
//! .await
//! }
//! ```
//!
//! ## Module Organization
//!
//! - [`core`] - Core GPU types: `Tile`, `Tensor`, `Partition`, and tile operations
//! - [`api`] - High-level tensor creation and manipulation functions
//! - [`tensor`] - GPU tensor type, partitioning, and memory management
//! - [`tile_async`] - Async execution primitives and kernel compilation infrastructure
//! - [`kernels`] - Pre-built optimized kernels (GEMM, creation ops, etc.)
//!
//! ## Key Concepts
//!
//! ### Tensors and Partitions
//!
//! A [`Tensor`](tensor::Tensor) represents data on the GPU. It can be partitioned into tiles
//! that map to CUDA thread blocks:
//!
//! ```rust,ignore
//! // Create a tensor and partition it into 64-element tiles
//! let x = api::ones([256]).partition([64]); // Creates 4 partitions
//! ```
//!
//! ### Tiles
//!
//! Within kernels, you work with [`Tile`](core::Tile) values that represent data in registers
//! or shared memory. Tiles support efficient element-wise operations and broadcasting:
//!
//! ```rust,ignore
//! let tile_a = load_tile_mut(tensor_a);
//! let tile_b = load_tile_mut(tensor_b);
//! let result = tile_a * tile_b + tile_a; // All operations on GPU registers
//! ```
//!
//! ### Async Execution
//!
//! All GPU operations return futures that can be awaited. The runtime automatically manages
//! dependencies, kernel launches, and memory transfers:
//!
//! ```rust,ignore
//! // Operations compose naturally with async/await
//! let x = api::zeros([1024]).await;
//! let y = my_kernel(x).await;
//! let z = another_kernel(y).await;
//! ```
//!
//! ## Feature Flags
//!
//! This crate currently has no optional features.
//!
//! ## Safety
//!
//! cuTile Rust uses unsafe code internally for FFI with CUDA and for performance-critical operations.
//! The public API is designed to be safe, with compile-time guarantees about memory access patterns
//! through the type system (e.g., partition shapes must divide tensor shapes evenly).
//!
//! ## Examples
//!
//! See the `cutile-examples` crate for more comprehensive examples including:
//! - Matrix multiplication (GEMM)
//! - Async execution patterns
//! - Custom kernel development
//!
//! ## Performance
//!
//! cuTile Rust kernels can achieve performance competitive with hand-written CUDA:
//! - Zero-cost abstractions: Rust compiles to MLIR then optimized PTX
//! - Compile-time specialization: Tile shapes and types are compile-time constants
//! - Kernel caching: Compiled kernels are cached per-device for reuse
//!
//! ## Learning Resources
//!
//! For comprehensive guides and tutorials, see the [cuTile Rust Book](https://nihalpasham.github.io/cutile-book/):
//! - **User Guide** - Complete walkthrough from hello world to advanced kernels
//! - **Async Execution Deep Dive** - Understanding the async execution model
//! - **Architecture & Design** - How cutile works under the hood
pub use core;
pub use cuda_async;
pub use cutile_compiler;
pub use module;
pub use half;
pub use num_traits;
pub use cuda_core;
// TODO (hme): Coordinate with Candle about our dependence on this.
pub use candle_core;
pub use ;