ordbog/
lib.rs

1// Copyright 2021 Graydon Hoare <graydon@pobox.com>
2// Licensed under the MIT and Apache-2.0 licenses.
3
4//! # Ordbog
5//!
6//! This is a small crate providing a single special-purpose lossy compresison
7//! code, designed for use as a "scan accelerator" for database storage. Such
8//! codes are not replacements for underlying values; rather they provide cheap
9//! approximate answers to predicates that may be sufficient to elide accessing
10//! the underlying data, similar to the way bloom filters can elide lookups, but
11//! supporting more general predicates (eg. tabulations, range-queries).
12//!
13//! Put another way: rewriting a query on the underlying data values to a query
14//! on codes can produce false positives -- requiring a secondary query of the
15//! underlying data -- but no false negatives. And for half the codes in a given
16//! dictionary (the "exact" codes, assigned to high-frequency inputs), they also
17//! do not produce false positives.
18//!
19//! The codes are "cheap" (i.e. actually useful for acceleration) for three
20//! reasons:
21//!
22//!   1. They are small, so conserve memory bandwidth: 1 or 2 bytes per code,
23//!      vs. 8 bytes for an underlying float/u64 value, or more for a string,
24//!      high resolution timestamp, uuid or large-decimal type.
25//!
26//!   2. They are simple integers, where the underlying data may be something
27//!      more costly to process.
28//!
29//!   3. They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a
30//!      time, and a GPU scan can look at hundreds at a time.
31//!
32//! The crate is equally usable for numeric, textual or categorical data. All it
33//! needs is something ordered. It includes wrapper types for floating point.
34//!
35//! The codes it produces have the following characteristics:
36//!
37//!   1. Each code value is logically 8 or 16 bits (depending on the `Mode`
38//!      enum). The user decides whether to operate with 8 or 16 bits: 8 bit
39//!      codes should be used for memory-only scans, to elide 64-byte cache-line
40//!      accesses; 16 bit codes should be used for disk scans, to elide 4k page
41//!      accesses.
42//!
43//!   2. Code value 0 is unused, so that subsequent compression can use it as a
44//!      sentinel or missing-value code.
45//!
46//!   3. All other codes alternate between even/exact (representing a specific
47//!      value in the input) and odd/inexact (representing an open interval of
48//!      possible input values). Values 1 and 0xff (or 0xffff, or whatever the
49//!      final odd code is in the dictionary) thus encode one-sided lower and
50//!      upper open intervals.
51//!
52//!   4. Codes are assigned to cover a _sample_ provided by the user, which is
53//!      internally sorted and then partitioned into equal-sized bins, including
54//!      duplicates. Then each run of duplicates within a bin is counted. The
55//!      sample value with the longest run -- i.e. the highest-frequency sample
56//!      value -- within each bin is given an (even) exact code. Then an (odd)
57//!      inexact code is given to each open interval of sample values between
58//!      sample values that were given exact codes. The provided sample should
59//!      therefore be big enough to be representative of the total input; but if
60//!      it is not representative, encoding still works, it just loses
61//!      efficiency.
62//!
63//!   5. The assigned codes imply order and preserve equality, specifically:
64//!        - `code(a) < code(b)` implies `a < b`
65//!        - `a < b` implies `code(a) <= code(b)`
66//!        - `a == b` implies `code(a) == code(b)`
67//!
68//! ## Reference
69//!
70//! Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column
71//! Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In
72//! Proceedings of the 2018 International Conference on Management of Data
73//! (SIGMOD '18). Association for Computing Machinery, New York, NY, USA,
74//! 857–872.
75//!
76//! DOI: <https://doi.org/10.1145/3183713.3196911>
77//!
78//! <https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf>
79//!
80//! ## Name
81//!
82//! Wikitionary (Danish):
83//!
84//! > Noun: ordbog (singular definite ordbogen, plural indefinite ordbøger)
85//! > 1. dictionary, lexicon
86//! >
87//! > Etymology: From ord ("word") +‎ bog ("book"). Compare Swedish ordbok,
88//! > English wordbook, German Wörterbuch.
89
90use float_ord::FloatOrd;
91use std::fmt::Debug;
92
93/// Wrapper that supplies a Default (1.0) value around [FloatOrd]. This is the
94/// type to use for a [Dict] of underlying [f64] values.
95#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
96pub struct DictF64(pub FloatOrd<f64>);
97
98impl Default for DictF64 {
99    fn default() -> Self {
100        DictF64(FloatOrd(1.0))
101    }
102}
103
104/// Wrapper that supplies a Default (1.0) value around [FloatOrd]. This is the
105/// type to use for a [Dict] of underlying [f32] values.
106#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
107pub struct DictF32(pub FloatOrd<f32>);
108
109impl Default for DictF32 {
110    fn default() -> Self {
111        DictF32(FloatOrd(1.0))
112    }
113}
114
115/// Wrapper for a [Dict] code value. If the [Dict] was
116/// built with [Mode::Byte], this will have values ranging only
117/// over `[1,255]`. If the [Dict] was built with [Mode::Word],
118/// this will have values ranging over `[1,65535]`.
119#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
120pub struct Code(pub u16);
121impl Code {
122
123    /// Return true iff the code is an _exact_ code, i.e. a code which
124    /// represents a single underlying value rather than a range of possible
125    /// values. This is true iff the code is an even number.
126    pub fn is_exact(&self) -> bool {
127        (self.0 & 1) == 0
128    }
129}
130
131/// Indicates whether to build a small [Dict] of up to 255 values
132/// or a larger one of up to 65535 values.
133#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
134pub enum Mode {
135    /// Build a [Dict] with up to 255 codes ranging over `[1,255]`. This mode is
136    /// most appropriate when building a sketch that elides accesses to smaller
137    /// underlying storage blocks like 64-byte cache lines, where the small
138    /// repertoire of codes (and thus higher per-element chance of a false
139    /// positive) is offset by the small size of each storage block (and thus
140    /// small number of elements).
141    Byte,
142    /// Build a [Dict] with up to 65535 codes ranging over `[1,65535]`. This
143    /// mode is most appropriate when building a sketch that elides access to
144    /// larger underlying storage blocks like 4096-byte pages, where the larger
145    /// number of elements per storage block demands a comparatively low false
146    /// positive probability per element.
147    Word,
148}
149impl Mode {
150    /// Returns the count of exact codes in the mode: either `127` for
151    /// [Mode::Byte] or `32767` for [Mode::Word].
152    pub fn num_exact_codes(&self) -> usize {
153        match self {
154            Mode::Byte => 127,
155            Mode::Word => 32767,
156        }
157    }
158    /// Returns the maximum exact code in the mode: either `0xfe` for
159    /// [Mode::Byte] or `0xfffe` for [Mode::Word].
160    pub fn max_exact_code(&self) -> Code {
161        match self {
162            Mode::Byte => Code(0xfe),
163            Mode::Word => Code(0xfffe),
164        }
165    }
166    /// Returns the maximum inexact code in the mode: either `0xff` for
167    /// [Mode::Byte] or `0xffff` for [Mode::Word].
168    pub fn max_inexact_code(&self) -> Code {
169        match self {
170            Mode::Byte => Code(0xff),
171            Mode::Word => Code(0xffff),
172        }
173    }
174}
175
176/// Trait expressing requirements for the types of underlying values
177/// that can be encoded in a [Dict].
178pub trait ValReq : Ord + Clone + Default /*+ Debug*/ {}
179impl<T> ValReq for T where T : Ord + Clone + Default /*+ Debug*/ {}
180
181struct Cluster<T: ValReq> {
182    value: T,
183    count: usize,
184}
185/// A dictionary over an underlying type `T` conforming to [ValReq]. The
186/// dictionary maps underlying values to [Code]s to use in a sketch, using
187/// [Dict::encode].
188pub struct Dict<T: ValReq> {
189
190    /// The mode the dictionary was built in.
191    pub mode: Mode,
192
193    /// A sorted vector of the values assigned exact codes in the dictionary.
194    /// Implicitly defines both exact and inexact code values based on the
195    /// positions of exact codes in the vector.
196    pub codes: Vec<T>,
197}
198
199impl<T: ValReq> Dict<T> {
200    fn clusters(sorted_sample: &Vec<T>) -> Vec<Cluster<T>> {
201        let mut clu = Vec::with_capacity(sorted_sample.len());
202        if !sorted_sample.is_empty() {
203            let mut curr = &sorted_sample[0];
204            let mut count = 0;
205            for i in sorted_sample.iter() {
206                if *i == *curr {
207                    count += 1;
208                } else {
209                    let value = (*curr).clone();
210                    clu.push(Cluster { value, count });
211                    curr = i;
212                    count = 1
213                }
214            }
215            let value = (*curr).clone();
216            clu.push(Cluster { value, count });
217        }
218        clu
219    }
220
221    /// Look up the code for a value of the underlying value type `T`.
222    pub fn encode(&self, query: &T) -> Code {
223        // The `self.code` array stores the input values assigned to "exact"
224        // codes, counting upwards from code 2. Thus a successful binary search
225        // landing at `idx` returns exact code `2*(idx+1)`. An unsuccessful
226        // binary search lands on the _next_ exact code greater than the query
227        // value, so we subtract 1 from that code to denote the inexact code
228        // covering the range below that next exact code.
229        let code = match self.codes.binary_search(query) {
230            Ok(idx) => 2 * (idx + 1),
231            Err(idx) => (2 * (idx + 1)) - 1,
232        };
233        assert!(code <= 0xffff);
234        Code(code as u16)
235    }
236
237    fn assign_codes_with_step(codestep: usize, clu: &Vec<Cluster<T>>) -> Vec<T> {
238        let mut codes = Vec::new();
239        let mut first_idx = 0;
240        while first_idx < clu.len() {
241            let mut last_idx = first_idx;
242            let mut idx_with_max_val = first_idx;
243            let mut cluster_count_sum = 0;
244            while last_idx < clu.len() && cluster_count_sum < codestep {
245                if clu[idx_with_max_val].count < clu[last_idx].count {
246                    idx_with_max_val = last_idx;
247                }
248                cluster_count_sum += clu[last_idx].count;
249                last_idx += 1;
250            }
251            codes.push(clu[idx_with_max_val].value.clone());
252            // FIXME: boundary condition might be wrong here, I think?
253            // does this skip the end of each cluster? Why is life full
254            // of boundary errors? *Sobs* I am such a fool.
255            first_idx = last_idx + 1;
256        }
257        codes
258    }
259
260    fn assign_codes_with_minimal_step(
261        samplesize: usize,
262        ncodes: usize,
263        clu: &Vec<Cluster<T>>,
264    ) -> Vec<T> {
265        assert!(samplesize != 0);
266        assert!(ncodes != 0);
267        assert!(ncodes < samplesize);
268
269        // Each code should cover at least codestep worth of the sample.
270        let mut codestep = samplesize / ncodes;
271
272        /*
273        println!(
274            "each of {} target codes should span {} samples",
275            ncodes, codestep
276        );
277        */
278
279        // We start with a basic dictionary with each code covering `codestep`
280        // sample vaules, calculated by taking elements from the cluster list.
281        let mut codes = Self::assign_codes_with_step(codestep, &clu);
282
283        // Unfortunately it's possible some of those clusters overshoot the
284        // `codestep`, giving us codes that cover too many sample values and
285        // therefore giving us too few overall codes. To correct for this, we
286        // want to iterate a few times (up to 8 times -- ad-hoc limit)
287        // estimating the error, reducing the `codestep` and re-encoding, to try
288        // to get as close as possible (without going over) the target number of
289        // codes.
290        for _ in 0..=8 {
291            assert!(!codes.is_empty());
292
293            // If we hit the target we're done.
294            if codes.len() == ncodes {
295                break;
296            }
297
298            // If we overshot the target, truncate the best attempt and return.
299            if codes.len() > ncodes {
300                codes.resize(ncodes, <T as Default>::default());
301                break;
302            }
303
304            // Otherwise estimate, reduce, and (if it's an improvement) accept.
305            assert!(codes.len() < ncodes);
306            let bias = (codes.len() * 10000) / ncodes;
307            codestep *= bias;
308            codestep /= 10000;
309            /*
310            println!(
311                "wrong number of codes ({}), adjusting step to {}",
312                codes.len(),
313                codestep
314            );
315            */
316            let next_codes = Self::assign_codes_with_step(codestep, &clu);
317            if next_codes.len() <= ncodes {
318                codes = next_codes;
319            } else {
320                break;
321            }
322        }
323        codes
324    }
325
326    /// Build a dictionary with a given [Mode] over a provided sample of the
327    /// underlying value type.
328    ///
329    /// The provided sample should be representative of the overall data, and
330    /// specifically should contain duplicates at a similar proportion to those
331    /// found in the overall data. If the sample is not representative, code
332    /// accuracy will degrade (thus false positives will increase) but nothing
333    /// else will go wrong.
334    ///
335    /// This function will sort the sample, so the sample should be small enough
336    /// that the caller can tolerate the running time of sorting it. Otherwise
337    /// the larger the sample, the more accurate the codes.
338    pub fn new(mode: Mode, mut sample: Vec<T>) -> Self {
339        // println!("beginning building dictionary from {} samples", sample.len());
340
341        // For an empty sample we haven't much to work with; assign exact code 2
342        // for the default value in the target type. Any value less than default
343        // will code as 1, any value greater as 3. That's it.
344        if sample.is_empty() {
345            // println!("empty sample, using 1-element default");
346            let codes = vec![<T as Default>::default()];
347            return Self { mode, codes };
348        }
349
350        // If we have a real sample, we want to sort it both to assign
351        // order-preserving codes and to cluster it for frequency analysis.
352        sample.sort_unstable();
353
354        // Do the frequency analysis.
355        let clu = Self::clusters(&sample);
356        assert!(!clu.is_empty());
357
358        /*
359        if clu.len() == sample.len()
360        {
361            println!("samples are all unique");
362        }
363        else
364        {
365            println!("reduced {} samples to {} clusters", sample.len(), clu.len());
366        }
367        */
368
369        let ncodes = mode.num_exact_codes();
370
371        // If there are the same or fewer clusters than the codespace, we can
372        // just assign one code per cluster, there's no need for anything
373        // fancier.
374        if clu.len() <= ncodes {
375            /*
376            println!(
377                "fewer clusters ({}) than target codes {}, using clusters",
378                clu.len(), ncodes);
379            */
380            let codes = clu.into_iter().map(|c| c.value).collect();
381            return Self { mode, codes };
382        }
383        let codes = Self::assign_codes_with_minimal_step(sample.len(), ncodes, &clu);
384        // println!("finished building dictionary with {} exact codes", codes.len());
385        Self { mode, codes }
386    }
387}