1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
//! Static and Dynamic shared memory handling.
use crategpu_only;
/// Statically allocates a buffer large enough for `len` elements of `array_type`, yielding
/// a `*mut array_type` that points to uninitialized shared memory. `len` must be a constant expression.
///
/// Note that this allocates the memory __statically__, it expands to a static in the `shared` address space.
/// Therefore, calling this macro multiple times in a loop will always yield the same data. However, separate
/// invocations of the macro will yield different buffers.
///
/// The data is uninitialized by default, therefore, you must be careful to not read the data before it is written to.
/// The semantics of what "uninitialized" actually means on the GPU (i.e. if it yields unknown data or if it is UB to read it whatsoever)
/// are not well known, so even if the type is valid for any backing memory, make sure to not read uninitialized data.
///
/// # Safety
///
/// Shared memory usage is fundamentally extremely unsafe and impossible to statically prove, therefore
/// the burden of correctness is on the user. Some of the things you must ensure in your usage of
/// shared memory are:
/// - Shared memory is only shared across __thread blocks__, not the entire device, therefore it is
/// unsound to try and rely on sharing data across more than one block.
/// - You must write to the shared buffer before reading from it as the data is uninitialized by default.
/// - [`thread::sync_threads`](crate::thread::sync_threads) must be called before relying on the results of other
/// threads, this ensures every thread has reached that point before going on. For example, reading another thread's
/// data after writing to the buffer.
/// - No access may be out of bounds, this usually means making sure the amount of threads and their dimensions are correct.
///
/// It is suggested to run your executable in `cuda-memcheck` to make sure usages of shared memory are right.
///
/// # Examples
///
/// ```no_run
/// #[kernel]
/// pub unsafe fn reverse_array(d: *mut i32, n: usize) {
/// let s = shared_array![i32; 64];
/// let t = thread::thread_idx_x() as usize;
/// let tr = n - t - 1;
/// *s.add(t) = *d.add(t);
/// thread::sync_threads();
/// *d.add(t) = *s.add(tr);
/// }
/// ```
/// Gets a pointer to the dynamic shared memory that was allocated by the caller of the kernel. The
/// data is left uninitialized.
///
/// **Calling this function multiple times will yield the same pointer**.