1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors
//! An array that partially "patches" another array with new values.
//!
//! # Background
//!
//! This is meant to be the foundation of a fully data-parallel patching strategy, based on the
//! work published in ["G-ALP" from Hepkema et al.](https://ir.cwi.nl/pub/35205/35205.pdf)
//!
//! Patching is common when an encoding almost completely covers an array save a few exceptions.
//! In that case, rather than avoid the encoding entirely, it's preferable to
//!
//! * Replace unencodable values with fillers (zeros, frequent values, nulls, etc.)
//! * Wrap the array with a `PatchedArray` signaling that when the original array is executed,
//! some of the decoded values must be overwritten.
//!
//! In Vortex, the FastLanes bit-packing encoding is often the terminal node in an encoding tree,
//! and FastLanes has an intrinsic chunking of 1024 elements. Thus, 1024 elements is pervasively
//! a useful unit of chunking throughout Vortex, and so we use 1024 as a chunk size here
//! as well.
//!
//! # Details
//!
//! To patch an array, we first divide it into a set of chunks of length 1024, and then within
//! each chunk, we assign each position to a lane. The number of lanes depends on the width of
//! the underlying type.
//!
//! Thus, rather than sorting patch indices and values by their global offset, they are sorted
//! primarily by their chunk, and then subsequently by their lanes.
//!
//! The Patched array layout has 4 children
//!
//! * `inner`: the inner array is the one containing encoded values, including the filler values
//! that need to be patched over at execution time
//! * `lane_offsets`: this is an indexing buffer that allows you to see into ranges of the other
//! two children
//! * `indices`: An array of `u16` chunk indices, indicating where within the chunk should the value
//! be overwritten by the patch value
//! * `values`: The child array containing the patch values, which should be inserted over
//! the values of the `inner` at the locations provided by `indices`
//!
//! `indices` and `values` are aligned and accessed together.
//!
//! ```text
//!
//! chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 chunk 0
//! lane 0 lane 1 lane 2 lane 3 lane 4 lane 5
//! ┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
//! lane_offsets │ 0 │ 0 │ 2 │ 2 │ 3 │ 5 │ ...
//! └─────┬──────┴─────┬──────┴─────┬──────┴──────┬─────┴──────┬─────┴──────┬─────┘
//! │ │ │ │ │ │
//! │ │ │ │ │ │
//! ┌─────┴────────────┘ └──────┬──────┘ ┌──────┘ └─────┐
//! │ │ │ │
//! │ │ │ │
//! │ │ │ │
//! ▼────────────┬────────────┬────────────▼────────────▼────────────┬────────────▼
//! indices │ │ │ │ │ │ │
//! │ │ │ │ │ │ │
//! ├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
//! values │ │ │ │ │ │ │
//! │ │ │ │ │ │ │
//! └────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘
//! ```
//!
//! It turns out that this layout is optimal for executing patching on GPUs, because the
//! `lane_offsets` allows each thread in a warp to seek to its patches in constant time.
pub use *;
use ByteBuffer;
pub use *;
/// Patches that have been transposed into GPU format.
/// Number of lanes used at patch time for a value of type `V`.
///
/// This is *NOT* equal to the number of FastLanes lanes for the type `V`, rather this is going to
/// correspond to how many "lanes" we will end up copying data on.
///
/// When applied on the CPU, this configuration doesn't really matter. On the GPU, it is based
/// on the number of patches involved here.
const