fil_rustacuda/lib.rs
1//! This crate provides a safe, user-friendly wrapper around the CUDA Driver API.
2//!
3//! # CUDA Terminology:
4//!
5//! ## Devices and Hosts:
6//!
7//! This crate and its documentation uses the terms "device" and "host" frequently, so it's worth
8//! explaining them in more detail. A device refers to a CUDA-capable GPU or similar device and its
9//! associated external memory space. The host is the CPU and its associated memory space. Data
10//! must be transferred from host memory to device memory before the device can use it for
11//! computations, and the results must then be transferred back to host memory.
12//!
13//! ## Contexts, Modules, Streams and Functions:
14//!
15//! A CUDA context is akin to a process on the host - it contains all of the state for working with
16//! a device, all memory allocations, etc. Each context is associated with a single device.
17//!
18//! A Module is similar to a shared-object library - it is a piece of compiled code which exports
19//! functions and global values. Functions can be loaded from modules and launched on a device as
20//! one might load a function from a shared-object file and call it. Functions are also known as
21//! kernels and the two terms will be used interchangeably.
22//!
23//! A Stream is akin to a thread - asynchronous work such as kernel execution can be queued into a
24//! stream. Work within a single stream will execute sequentially in the order that it was
25//! submitted, and may interleave with work from other streams.
26//!
27//! ## Grids, Blocks and Threads:
28//!
29//! CUDA devices typically execute kernel functions on many threads in parallel. These threads can
30//! be grouped into thread blocks, which share an area of fast hardware memory known as shared
31//! memory. Thread blocks can be one-, two-, or three-dimensional, which is helpful when working
32//! with multi-dimensional data such as images. Thread blocks are then grouped into grids, which
33//! can also be one-, two-, or three-dimensional.
34//!
35//! CUDA devices often contain multiple separate processors. Each processor is capable of excuting
36//! many threads simultaneously, but they must be from the same thread block. Thus, it is important
37//! to ensure that the grid size is large enough to provide work for all processors. On the other
38//! hand, if the thread blocks are too small each processor will be under-utilized and the
39//! code will be unable to make effective use of shared memory.
40//!
41//! # Usage:
42//!
43//! Before using RustaCUDA, you must install the CUDA development libraries for your system. Version
44//! 8.0 or newer is required. You must also have a CUDA-capable GPU installed with the appropriate
45//! drivers.
46//!
47//! Add the following to your `Cargo.toml`:
48//!
49//! ```text
50//! [dependencies]
51//! rustacuda = "0.1"
52//! rustacuda_derive = "0.1"
53//! rustacuda_core = "0.1"
54//! ```
55//!
56//! And this to your crate root:
57//!
58//! ```text
59//! #[macro_use]
60//! extern crate rustacuda;
61//!
62//! #[macro_use]
63//! extern crate rustacuda_derive;
64//!
65//! extern crate rustacuda_core;
66//! ```
67//!
68//! Finally, set the `CUDA_LIBRARY_PATH` environment variable to the location of your CUDA libraries.
69//! For example, on Windows (MINGW):
70//!
71//! ```text
72//! export CUDA_LIBRARY_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1\lib\x64"
73//! ```
74//!
75//! # Examples
76//!
77//! ## Adding two numbers on the device:
78//!
79//! First, download the `resources/add.ptx` file from the RustaCUDA repository and place it in
80//! the resources directory for your application.
81//!
82//! ```
83//! #[macro_use]
84//! extern crate rustacuda;
85//! extern crate rustacuda_core;
86//!
87//! use rustacuda::prelude::*;
88//! use rustacuda::memory::DeviceBox;
89//! use std::error::Error;
90//! use std::ffi::CString;
91//!
92//! fn main() -> Result<(), Box<dyn Error>> {
93//! // Initialize the CUDA API
94//! rustacuda::init(CudaFlags::empty())?;
95//!
96//! // Get the first device
97//! let device = Device::get_device(0)?;
98//!
99//! // Create a context associated to this device
100//! let context = Context::create_and_push(
101//! ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device)?;
102//!
103//! // Load the module containing the function we want to call
104//! let module_data = CString::new(include_str!("../resources/add.ptx"))?;
105//! let module = Module::load_from_string(&module_data)?;
106//!
107//! // Create a stream to submit work to
108//! let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?;
109//!
110//! // Allocate space on the device and copy numbers to it.
111//! let mut x = DeviceBox::new(&10.0f32)?;
112//! let mut y = DeviceBox::new(&20.0f32)?;
113//! let mut result = DeviceBox::new(&0.0f32)?;
114//!
115//! // Launching kernels is unsafe since Rust can't enforce safety - think of kernel launches
116//! // as a foreign-function call. In this case, it is - this kernel is written in CUDA C.
117//! unsafe {
118//! // Launch the `sum` function with one block containing one thread on the given stream.
119//! launch!(module.sum<<<1, 1, 0, stream>>>(
120//! x.as_device_ptr(),
121//! y.as_device_ptr(),
122//! result.as_device_ptr(),
123//! 1 // Length
124//! ))?;
125//! }
126//!
127//! // The kernel launch is asynchronous, so we wait for the kernel to finish executing
128//! stream.synchronize()?;
129//!
130//! // Copy the result back to the host
131//! let mut result_host = 0.0f32;
132//! result.copy_to(&mut result_host)?;
133//!
134//! println!("Sum is {}", result_host);
135//! # assert_eq!(30, result_host as u32);
136//!
137//! Ok(())
138//! }
139//! ```
140
141#![warn(
142 missing_docs,
143 missing_debug_implementations,
144 unused_import_braces,
145 unused_results,
146 unused_qualifications
147)]
148// TODO: Add the missing_doc_code_examples warning, switch these to Deny later.
149
150// Allow clippy lints
151#![allow(unknown_lints, clippy::new_ret_no_self)]
152
153#[macro_use]
154extern crate bitflags;
155//extern crate cuda_sys;
156extern crate rustacuda_core;
157
158#[allow(unused_imports, clippy::useless_attribute)]
159#[macro_use]
160extern crate rustacuda_derive;
161#[doc(hidden)]
162pub use rustacuda_derive::*;
163
164pub mod context;
165pub mod device;
166pub mod error;
167pub mod event;
168pub mod function;
169pub mod memory;
170pub mod module;
171pub mod prelude;
172pub mod stream;
173
174mod derive_compile_fail;
175
176use crate::context::{Context, ContextFlags};
177use crate::device::Device;
178use crate::error::{CudaResult, ToResult};
179use cuda_driver_sys::{cuDriverGetVersion, cuInit};
180
181bitflags! {
182 /// Bit flags for initializing the CUDA driver. Currently, no flags are defined,
183 /// so `CudaFlags::empty()` is the only valid value.
184 pub struct CudaFlags: u32 {
185 // We need to give bitflags at least one constant.
186 #[doc(hidden)]
187 const _ZERO = 0;
188 }
189}
190
191/// Initialize the CUDA Driver API.
192///
193/// This must be called before any other RustaCUDA (or CUDA) function is called. Typically, this
194/// should be at the start of your program. All other functions will fail unless the API is
195/// initialized first.
196///
197/// The `flags` parameter is used to configure the CUDA API. Currently no flags are defined, so
198/// it must be `CudaFlags::empty()`.
199pub fn init(flags: CudaFlags) -> CudaResult<()> {
200 unsafe { cuInit(flags.bits()).to_result() }
201}
202
203/// Shortcut for initializing the CUDA Driver API and creating a CUDA context with default settings
204/// for the first device.
205///
206/// This is useful for testing or just setting up a basic CUDA context quickly. Users with more
207/// complex needs (multiple devices, custom flags, etc.) should use `init` and create their own
208/// context.
209pub fn quick_init() -> CudaResult<Context> {
210 init(CudaFlags::empty())?;
211 let device = Device::get_device(0)?;
212 Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device)
213}
214
215/// Struct representing the CUDA API version number.
216#[derive(Debug, Hash, Eq, PartialEq, Ord, PartialOrd, Copy, Clone)]
217pub struct CudaApiVersion {
218 version: i32,
219}
220impl CudaApiVersion {
221 /// Returns the latest CUDA version supported by the CUDA driver.
222 pub fn get() -> CudaResult<CudaApiVersion> {
223 unsafe {
224 let mut version: i32 = 0;
225 cuDriverGetVersion(&mut version as *mut i32).to_result()?;
226 Ok(CudaApiVersion { version })
227 }
228 }
229
230 /// Return the major version number - eg. the 9 in version 9.2
231 #[inline]
232 pub fn major(self) -> i32 {
233 self.version / 1000
234 }
235
236 /// Return the minor version number - eg. the 2 in version 9.2
237 #[inline]
238 pub fn minor(self) -> i32 {
239 (self.version % 1000) / 10
240 }
241}
242
243#[cfg(test)]
244mod test {
245 use super::*;
246
247 #[test]
248 fn test_api_version() {
249 let version = CudaApiVersion { version: 9020 };
250 assert_eq!(version.major(), 9);
251 assert_eq!(version.minor(), 2);
252 }
253
254 #[test]
255 fn test_init_twice() {
256 init(CudaFlags::empty()).unwrap();
257 init(CudaFlags::empty()).unwrap();
258 }
259}
260
261// Fake module with a private trait used to prevent outside code from implementing certain traits.
262pub(crate) mod private {
263 pub trait Sealed {}
264}