ordbog/lib.rs
1// Copyright 2021 Graydon Hoare <graydon@pobox.com>
2// Licensed under the MIT and Apache-2.0 licenses.
3
4//! # Ordbog
5//!
6//! This is a small crate providing a single special-purpose lossy compresison
7//! code, designed for use as a "scan accelerator" for database storage. Such
8//! codes are not replacements for underlying values; rather they provide cheap
9//! approximate answers to predicates that may be sufficient to elide accessing
10//! the underlying data, similar to the way bloom filters can elide lookups, but
11//! supporting more general predicates (eg. tabulations, range-queries).
12//!
13//! Put another way: rewriting a query on the underlying data values to a query
14//! on codes can produce false positives -- requiring a secondary query of the
15//! underlying data -- but no false negatives. And for half the codes in a given
16//! dictionary (the "exact" codes, assigned to high-frequency inputs), they also
17//! do not produce false positives.
18//!
19//! The codes are "cheap" (i.e. actually useful for acceleration) for three
20//! reasons:
21//!
22//! 1. They are small, so conserve memory bandwidth: 1 or 2 bytes per code,
23//! vs. 8 bytes for an underlying float/u64 value, or more for a string,
24//! high resolution timestamp, uuid or large-decimal type.
25//!
26//! 2. They are simple integers, where the underlying data may be something
27//! more costly to process.
28//!
29//! 3. They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a
30//! time, and a GPU scan can look at hundreds at a time.
31//!
32//! The crate is equally usable for numeric, textual or categorical data. All it
33//! needs is something ordered. It includes wrapper types for floating point.
34//!
35//! The codes it produces have the following characteristics:
36//!
37//! 1. Each code value is logically 8 or 16 bits (depending on the `Mode`
38//! enum). The user decides whether to operate with 8 or 16 bits: 8 bit
39//! codes should be used for memory-only scans, to elide 64-byte cache-line
40//! accesses; 16 bit codes should be used for disk scans, to elide 4k page
41//! accesses.
42//!
43//! 2. Code value 0 is unused, so that subsequent compression can use it as a
44//! sentinel or missing-value code.
45//!
46//! 3. All other codes alternate between even/exact (representing a specific
47//! value in the input) and odd/inexact (representing an open interval of
48//! possible input values). Values 1 and 0xff (or 0xffff, or whatever the
49//! final odd code is in the dictionary) thus encode one-sided lower and
50//! upper open intervals.
51//!
52//! 4. Codes are assigned to cover a _sample_ provided by the user, which is
53//! internally sorted and then partitioned into equal-sized bins, including
54//! duplicates. Then each run of duplicates within a bin is counted. The
55//! sample value with the longest run -- i.e. the highest-frequency sample
56//! value -- within each bin is given an (even) exact code. Then an (odd)
57//! inexact code is given to each open interval of sample values between
58//! sample values that were given exact codes. The provided sample should
59//! therefore be big enough to be representative of the total input; but if
60//! it is not representative, encoding still works, it just loses
61//! efficiency.
62//!
63//! 5. The assigned codes imply order and preserve equality, specifically:
64//! - `code(a) < code(b)` implies `a < b`
65//! - `a < b` implies `code(a) <= code(b)`
66//! - `a == b` implies `code(a) == code(b)`
67//!
68//! ## Reference
69//!
70//! Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column
71//! Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In
72//! Proceedings of the 2018 International Conference on Management of Data
73//! (SIGMOD '18). Association for Computing Machinery, New York, NY, USA,
74//! 857–872.
75//!
76//! DOI: <https://doi.org/10.1145/3183713.3196911>
77//!
78//! <https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf>
79//!
80//! ## Name
81//!
82//! Wikitionary (Danish):
83//!
84//! > Noun: ordbog (singular definite ordbogen, plural indefinite ordbøger)
85//! > 1. dictionary, lexicon
86//! >
87//! > Etymology: From ord ("word") + bog ("book"). Compare Swedish ordbok,
88//! > English wordbook, German Wörterbuch.
89
90use float_ord::FloatOrd;
91use std::fmt::Debug;
92
93/// Wrapper that supplies a Default (1.0) value around [FloatOrd]. This is the
94/// type to use for a [Dict] of underlying [f64] values.
95#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
96pub struct DictF64(pub FloatOrd<f64>);
97
98impl Default for DictF64 {
99 fn default() -> Self {
100 DictF64(FloatOrd(1.0))
101 }
102}
103
104/// Wrapper that supplies a Default (1.0) value around [FloatOrd]. This is the
105/// type to use for a [Dict] of underlying [f32] values.
106#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
107pub struct DictF32(pub FloatOrd<f32>);
108
109impl Default for DictF32 {
110 fn default() -> Self {
111 DictF32(FloatOrd(1.0))
112 }
113}
114
115/// Wrapper for a [Dict] code value. If the [Dict] was
116/// built with [Mode::Byte], this will have values ranging only
117/// over `[1,255]`. If the [Dict] was built with [Mode::Word],
118/// this will have values ranging over `[1,65535]`.
119#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
120pub struct Code(pub u16);
121impl Code {
122
123 /// Return true iff the code is an _exact_ code, i.e. a code which
124 /// represents a single underlying value rather than a range of possible
125 /// values. This is true iff the code is an even number.
126 pub fn is_exact(&self) -> bool {
127 (self.0 & 1) == 0
128 }
129}
130
131/// Indicates whether to build a small [Dict] of up to 255 values
132/// or a larger one of up to 65535 values.
133#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
134pub enum Mode {
135 /// Build a [Dict] with up to 255 codes ranging over `[1,255]`. This mode is
136 /// most appropriate when building a sketch that elides accesses to smaller
137 /// underlying storage blocks like 64-byte cache lines, where the small
138 /// repertoire of codes (and thus higher per-element chance of a false
139 /// positive) is offset by the small size of each storage block (and thus
140 /// small number of elements).
141 Byte,
142 /// Build a [Dict] with up to 65535 codes ranging over `[1,65535]`. This
143 /// mode is most appropriate when building a sketch that elides access to
144 /// larger underlying storage blocks like 4096-byte pages, where the larger
145 /// number of elements per storage block demands a comparatively low false
146 /// positive probability per element.
147 Word,
148}
149impl Mode {
150 /// Returns the count of exact codes in the mode: either `127` for
151 /// [Mode::Byte] or `32767` for [Mode::Word].
152 pub fn num_exact_codes(&self) -> usize {
153 match self {
154 Mode::Byte => 127,
155 Mode::Word => 32767,
156 }
157 }
158 /// Returns the maximum exact code in the mode: either `0xfe` for
159 /// [Mode::Byte] or `0xfffe` for [Mode::Word].
160 pub fn max_exact_code(&self) -> Code {
161 match self {
162 Mode::Byte => Code(0xfe),
163 Mode::Word => Code(0xfffe),
164 }
165 }
166 /// Returns the maximum inexact code in the mode: either `0xff` for
167 /// [Mode::Byte] or `0xffff` for [Mode::Word].
168 pub fn max_inexact_code(&self) -> Code {
169 match self {
170 Mode::Byte => Code(0xff),
171 Mode::Word => Code(0xffff),
172 }
173 }
174}
175
176/// Trait expressing requirements for the types of underlying values
177/// that can be encoded in a [Dict].
178pub trait ValReq : Ord + Clone + Default /*+ Debug*/ {}
179impl<T> ValReq for T where T : Ord + Clone + Default /*+ Debug*/ {}
180
181struct Cluster<T: ValReq> {
182 value: T,
183 count: usize,
184}
185/// A dictionary over an underlying type `T` conforming to [ValReq]. The
186/// dictionary maps underlying values to [Code]s to use in a sketch, using
187/// [Dict::encode].
188pub struct Dict<T: ValReq> {
189
190 /// The mode the dictionary was built in.
191 pub mode: Mode,
192
193 /// A sorted vector of the values assigned exact codes in the dictionary.
194 /// Implicitly defines both exact and inexact code values based on the
195 /// positions of exact codes in the vector.
196 pub codes: Vec<T>,
197}
198
199impl<T: ValReq> Dict<T> {
200 fn clusters(sorted_sample: &Vec<T>) -> Vec<Cluster<T>> {
201 let mut clu = Vec::with_capacity(sorted_sample.len());
202 if !sorted_sample.is_empty() {
203 let mut curr = &sorted_sample[0];
204 let mut count = 0;
205 for i in sorted_sample.iter() {
206 if *i == *curr {
207 count += 1;
208 } else {
209 let value = (*curr).clone();
210 clu.push(Cluster { value, count });
211 curr = i;
212 count = 1
213 }
214 }
215 let value = (*curr).clone();
216 clu.push(Cluster { value, count });
217 }
218 clu
219 }
220
221 /// Look up the code for a value of the underlying value type `T`.
222 pub fn encode(&self, query: &T) -> Code {
223 // The `self.code` array stores the input values assigned to "exact"
224 // codes, counting upwards from code 2. Thus a successful binary search
225 // landing at `idx` returns exact code `2*(idx+1)`. An unsuccessful
226 // binary search lands on the _next_ exact code greater than the query
227 // value, so we subtract 1 from that code to denote the inexact code
228 // covering the range below that next exact code.
229 let code = match self.codes.binary_search(query) {
230 Ok(idx) => 2 * (idx + 1),
231 Err(idx) => (2 * (idx + 1)) - 1,
232 };
233 assert!(code <= 0xffff);
234 Code(code as u16)
235 }
236
237 fn assign_codes_with_step(codestep: usize, clu: &Vec<Cluster<T>>) -> Vec<T> {
238 let mut codes = Vec::new();
239 let mut first_idx = 0;
240 while first_idx < clu.len() {
241 let mut last_idx = first_idx;
242 let mut idx_with_max_val = first_idx;
243 let mut cluster_count_sum = 0;
244 while last_idx < clu.len() && cluster_count_sum < codestep {
245 if clu[idx_with_max_val].count < clu[last_idx].count {
246 idx_with_max_val = last_idx;
247 }
248 cluster_count_sum += clu[last_idx].count;
249 last_idx += 1;
250 }
251 codes.push(clu[idx_with_max_val].value.clone());
252 // FIXME: boundary condition might be wrong here, I think?
253 // does this skip the end of each cluster? Why is life full
254 // of boundary errors? *Sobs* I am such a fool.
255 first_idx = last_idx + 1;
256 }
257 codes
258 }
259
260 fn assign_codes_with_minimal_step(
261 samplesize: usize,
262 ncodes: usize,
263 clu: &Vec<Cluster<T>>,
264 ) -> Vec<T> {
265 assert!(samplesize != 0);
266 assert!(ncodes != 0);
267 assert!(ncodes < samplesize);
268
269 // Each code should cover at least codestep worth of the sample.
270 let mut codestep = samplesize / ncodes;
271
272 /*
273 println!(
274 "each of {} target codes should span {} samples",
275 ncodes, codestep
276 );
277 */
278
279 // We start with a basic dictionary with each code covering `codestep`
280 // sample vaules, calculated by taking elements from the cluster list.
281 let mut codes = Self::assign_codes_with_step(codestep, &clu);
282
283 // Unfortunately it's possible some of those clusters overshoot the
284 // `codestep`, giving us codes that cover too many sample values and
285 // therefore giving us too few overall codes. To correct for this, we
286 // want to iterate a few times (up to 8 times -- ad-hoc limit)
287 // estimating the error, reducing the `codestep` and re-encoding, to try
288 // to get as close as possible (without going over) the target number of
289 // codes.
290 for _ in 0..=8 {
291 assert!(!codes.is_empty());
292
293 // If we hit the target we're done.
294 if codes.len() == ncodes {
295 break;
296 }
297
298 // If we overshot the target, truncate the best attempt and return.
299 if codes.len() > ncodes {
300 codes.resize(ncodes, <T as Default>::default());
301 break;
302 }
303
304 // Otherwise estimate, reduce, and (if it's an improvement) accept.
305 assert!(codes.len() < ncodes);
306 let bias = (codes.len() * 10000) / ncodes;
307 codestep *= bias;
308 codestep /= 10000;
309 /*
310 println!(
311 "wrong number of codes ({}), adjusting step to {}",
312 codes.len(),
313 codestep
314 );
315 */
316 let next_codes = Self::assign_codes_with_step(codestep, &clu);
317 if next_codes.len() <= ncodes {
318 codes = next_codes;
319 } else {
320 break;
321 }
322 }
323 codes
324 }
325
326 /// Build a dictionary with a given [Mode] over a provided sample of the
327 /// underlying value type.
328 ///
329 /// The provided sample should be representative of the overall data, and
330 /// specifically should contain duplicates at a similar proportion to those
331 /// found in the overall data. If the sample is not representative, code
332 /// accuracy will degrade (thus false positives will increase) but nothing
333 /// else will go wrong.
334 ///
335 /// This function will sort the sample, so the sample should be small enough
336 /// that the caller can tolerate the running time of sorting it. Otherwise
337 /// the larger the sample, the more accurate the codes.
338 pub fn new(mode: Mode, mut sample: Vec<T>) -> Self {
339 // println!("beginning building dictionary from {} samples", sample.len());
340
341 // For an empty sample we haven't much to work with; assign exact code 2
342 // for the default value in the target type. Any value less than default
343 // will code as 1, any value greater as 3. That's it.
344 if sample.is_empty() {
345 // println!("empty sample, using 1-element default");
346 let codes = vec![<T as Default>::default()];
347 return Self { mode, codes };
348 }
349
350 // If we have a real sample, we want to sort it both to assign
351 // order-preserving codes and to cluster it for frequency analysis.
352 sample.sort_unstable();
353
354 // Do the frequency analysis.
355 let clu = Self::clusters(&sample);
356 assert!(!clu.is_empty());
357
358 /*
359 if clu.len() == sample.len()
360 {
361 println!("samples are all unique");
362 }
363 else
364 {
365 println!("reduced {} samples to {} clusters", sample.len(), clu.len());
366 }
367 */
368
369 let ncodes = mode.num_exact_codes();
370
371 // If there are the same or fewer clusters than the codespace, we can
372 // just assign one code per cluster, there's no need for anything
373 // fancier.
374 if clu.len() <= ncodes {
375 /*
376 println!(
377 "fewer clusters ({}) than target codes {}, using clusters",
378 clu.len(), ncodes);
379 */
380 let codes = clu.into_iter().map(|c| c.value).collect();
381 return Self { mode, codes };
382 }
383 let codes = Self::assign_codes_with_minimal_step(sample.len(), ncodes, &clu);
384 // println!("finished building dictionary with {} exact codes", codes.len());
385 Self { mode, codes }
386 }
387}