1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
// Copyright 2021 Graydon Hoare <graydon@pobox.com>
// Licensed under the MIT and Apache-2.0 licenses.

//! # Ordbog
//!
//! This is a small crate providing a single special-purpose lossy compresison
//! code, designed for use as a "scan accelerator" for database storage. Such
//! codes are not replacements for underlying values; rather they provide cheap
//! approximate answers to predicates that may be sufficient to elide accessing
//! the underlying data, similar to the way bloom filters can elide lookups, but
//! supporting more general predicates (eg. tabulations, range-queries).
//!
//! Put another way: rewriting a query on the underlying data values to a query
//! on codes can produce false positives -- requiring a secondary query of the
//! underlying data -- but no false negatives. And for half the codes in a given
//! dictionary (the "exact" codes, assigned to high-frequency inputs), they also
//! do not produce false positives.
//!
//! The codes are "cheap" (i.e. actually useful for acceleration) for three
//! reasons:
//!
//!   1. They are small, so conserve memory bandwidth: 1 or 2 bytes per code,
//!      vs. 8 bytes for an underlying float/u64 value, or more for a string,
//!      high resolution timestamp, uuid or large-decimal type.
//!
//!   2. They are simple integers, where the underlying data may be something
//!      more costly to process.
//!
//!   3. They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a
//!      time, and a GPU scan can look at hundreds at a time.
//!
//! The crate is equally usable for numeric, textual or categorical data. All it
//! needs is something ordered. It includes wrapper types for floating point.
//!
//! The codes it produces have the following characteristics:
//!
//!   1. Each code value is logically 8 or 16 bits (depending on the `Mode`
//!      enum). The user decides whether to operate with 8 or 16 bits: 8 bit
//!      codes should be used for memory-only scans, to elide 64-byte cache-line
//!      accesses; 16 bit codes should be used for disk scans, to elide 4k page
//!      accesses.
//!
//!   2. Code value 0 is unused, so that subsequent compression can use it as a
//!      sentinel or missing-value code.
//!
//!   3. All other codes alternate between even/exact (representing a specific
//!      value in the input) and odd/inexact (representing an open interval of
//!      possible input values). Values 1 and 0xff (or 0xffff, or whatever the
//!      final odd code is in the dictionary) thus encode one-sided lower and
//!      upper open intervals.
//!
//!   4. Codes are assigned to cover a _sample_ provided by the user, which is
//!      internally sorted and then partitioned into equal-sized bins, including
//!      duplicates. Then each run of duplicates within a bin is counted. The
//!      sample value with the longest run -- i.e. the highest-frequency sample
//!      value -- within each bin is given an (even) exact code. Then an (odd)
//!      inexact code is given to each open interval of sample values between
//!      sample values that were given exact codes. The provided sample should
//!      therefore be big enough to be representative of the total input; but if
//!      it is not representative, encoding still works, it just loses
//!      efficiency.
//!
//!   5. The assigned codes imply order and preserve equality, specifically:
//!        - `code(a) < code(b)` implies `a < b`
//!        - `a < b` implies `code(a) <= code(b)`
//!        - `a == b` implies `code(a) == code(b)`
//!
//! ## Reference
//!
//! Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column
//! Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In
//! Proceedings of the 2018 International Conference on Management of Data
//! (SIGMOD '18). Association for Computing Machinery, New York, NY, USA,
//! 857–872.
//!
//! DOI: <https://doi.org/10.1145/3183713.3196911>
//!
//! <https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf>
//!
//! ## Name
//!
//! Wikitionary (Danish):
//!
//! > Noun: ordbog (singular definite ordbogen, plural indefinite ordbøger)
//! > 1. dictionary, lexicon
//! >
//! > Etymology: From ord ("word") +‎ bog ("book"). Compare Swedish ordbok,
//! > English wordbook, German Wörterbuch.

use float_ord::FloatOrd;
use std::fmt::Debug;

/// Wrapper that supplies a Default (1.0) value around [FloatOrd]. This is the
/// type to use for a [Dict] of underlying [f64] values.
#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
pub struct DictF64(pub FloatOrd<f64>);

impl Default for DictF64 {
    fn default() -> Self {
        DictF64(FloatOrd(1.0))
    }
}

/// Wrapper that supplies a Default (1.0) value around [FloatOrd]. This is the
/// type to use for a [Dict] of underlying [f32] values.
#[derive(Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
pub struct DictF32(pub FloatOrd<f32>);

impl Default for DictF32 {
    fn default() -> Self {
        DictF32(FloatOrd(1.0))
    }
}

/// Wrapper for a [Dict] code value. If the [Dict] was
/// built with [Mode::Byte], this will have values ranging only
/// over `[1,255]`. If the [Dict] was built with [Mode::Word],
/// this will have values ranging over `[1,65535]`.
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
pub struct Code(pub u16);
impl Code {

    /// Return true iff the code is an _exact_ code, i.e. a code which
    /// represents a single underlying value rather than a range of possible
    /// values. This is true iff the code is an even number.
    pub fn is_exact(&self) -> bool {
        (self.0 & 1) == 0
    }
}

/// Indicates whether to build a small [Dict] of up to 255 values
/// or a larger one of up to 65535 values.
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)]
pub enum Mode {
    /// Build a [Dict] with up to 255 codes ranging over `[1,255]`. This mode is
    /// most appropriate when building a sketch that elides accesses to smaller
    /// underlying storage blocks like 64-byte cache lines, where the small
    /// repertoire of codes (and thus higher per-element chance of a false
    /// positive) is offset by the small size of each storage block (and thus
    /// small number of elements).
    Byte,
    /// Build a [Dict] with up to 65535 codes ranging over `[1,65535]`. This
    /// mode is most appropriate when building a sketch that elides access to
    /// larger underlying storage blocks like 4096-byte pages, where the larger
    /// number of elements per storage block demands a comparatively low false
    /// positive probability per element.
    Word,
}
impl Mode {
    /// Returns the count of exact codes in the mode: either `127` for
    /// [Mode::Byte] or `32767` for [Mode::Word].
    pub fn num_exact_codes(&self) -> usize {
        match self {
            Mode::Byte => 127,
            Mode::Word => 32767,
        }
    }
    /// Returns the maximum exact code in the mode: either `0xfe` for
    /// [Mode::Byte] or `0xfffe` for [Mode::Word].
    pub fn max_exact_code(&self) -> Code {
        match self {
            Mode::Byte => Code(0xfe),
            Mode::Word => Code(0xfffe),
        }
    }
    /// Returns the maximum inexact code in the mode: either `0xff` for
    /// [Mode::Byte] or `0xffff` for [Mode::Word].
    pub fn max_inexact_code(&self) -> Code {
        match self {
            Mode::Byte => Code(0xff),
            Mode::Word => Code(0xffff),
        }
    }
}

/// Trait expressing requirements for the types of underlying values
/// that can be encoded in a [Dict].
pub trait ValReq : Ord + Clone + Default /*+ Debug*/ {}
impl<T> ValReq for T where T : Ord + Clone + Default /*+ Debug*/ {}

struct Cluster<T: ValReq> {
    value: T,
    count: usize,
}
/// A dictionary over an underlying type `T` conforming to [ValReq]. The
/// dictionary maps underlying values to [Code]s to use in a sketch, using
/// [Dict::encode].
pub struct Dict<T: ValReq> {

    /// The mode the dictionary was built in.
    pub mode: Mode,

    /// A sorted vector of the values assigned exact codes in the dictionary.
    /// Implicitly defines both exact and inexact code values based on the
    /// positions of exact codes in the vector.
    pub codes: Vec<T>,
}

impl<T: ValReq> Dict<T> {
    fn clusters(sorted_sample: &Vec<T>) -> Vec<Cluster<T>> {
        let mut clu = Vec::with_capacity(sorted_sample.len());
        if !sorted_sample.is_empty() {
            let mut curr = &sorted_sample[0];
            let mut count = 0;
            for i in sorted_sample.iter() {
                if *i == *curr {
                    count += 1;
                } else {
                    let value = (*curr).clone();
                    clu.push(Cluster { value, count });
                    curr = i;
                    count = 1
                }
            }
            let value = (*curr).clone();
            clu.push(Cluster { value, count });
        }
        clu
    }

    /// Look up the code for a value of the underlying value type `T`.
    pub fn encode(&self, query: &T) -> Code {
        // The `self.code` array stores the input values assigned to "exact"
        // codes, counting upwards from code 2. Thus a successful binary search
        // landing at `idx` returns exact code `2*(idx+1)`. An unsuccessful
        // binary search lands on the _next_ exact code greater than the query
        // value, so we subtract 1 from that code to denote the inexact code
        // covering the range below that next exact code.
        let code = match self.codes.binary_search(query) {
            Ok(idx) => 2 * (idx + 1),
            Err(idx) => (2 * (idx + 1)) - 1,
        };
        assert!(code <= 0xffff);
        Code(code as u16)
    }

    fn assign_codes_with_step(codestep: usize, clu: &Vec<Cluster<T>>) -> Vec<T> {
        let mut codes = Vec::new();
        let mut first_idx = 0;
        while first_idx < clu.len() {
            let mut last_idx = first_idx;
            let mut idx_with_max_val = first_idx;
            let mut cluster_count_sum = 0;
            while last_idx < clu.len() && cluster_count_sum < codestep {
                if clu[idx_with_max_val].count < clu[last_idx].count {
                    idx_with_max_val = last_idx;
                }
                cluster_count_sum += clu[last_idx].count;
                last_idx += 1;
            }
            codes.push(clu[idx_with_max_val].value.clone());
            // FIXME: boundary condition might be wrong here, I think?
            // does this skip the end of each cluster? Why is life full
            // of boundary errors? *Sobs* I am such a fool.
            first_idx = last_idx + 1;
        }
        codes
    }

    fn assign_codes_with_minimal_step(
        samplesize: usize,
        ncodes: usize,
        clu: &Vec<Cluster<T>>,
    ) -> Vec<T> {
        assert!(samplesize != 0);
        assert!(ncodes != 0);
        assert!(ncodes < samplesize);

        // Each code should cover at least codestep worth of the sample.
        let mut codestep = samplesize / ncodes;

        /*
        println!(
            "each of {} target codes should span {} samples",
            ncodes, codestep
        );
        */

        // We start with a basic dictionary with each code covering `codestep`
        // sample vaules, calculated by taking elements from the cluster list.
        let mut codes = Self::assign_codes_with_step(codestep, &clu);

        // Unfortunately it's possible some of those clusters overshoot the
        // `codestep`, giving us codes that cover too many sample values and
        // therefore giving us too few overall codes. To correct for this, we
        // want to iterate a few times (up to 8 times -- ad-hoc limit)
        // estimating the error, reducing the `codestep` and re-encoding, to try
        // to get as close as possible (without going over) the target number of
        // codes.
        for _ in 0..=8 {
            assert!(!codes.is_empty());

            // If we hit the target we're done.
            if codes.len() == ncodes {
                break;
            }

            // If we overshot the target, truncate the best attempt and return.
            if codes.len() > ncodes {
                codes.resize(ncodes, <T as Default>::default());
                break;
            }

            // Otherwise estimate, reduce, and (if it's an improvement) accept.
            assert!(codes.len() < ncodes);
            let bias = (codes.len() * 10000) / ncodes;
            codestep *= bias;
            codestep /= 10000;
            /*
            println!(
                "wrong number of codes ({}), adjusting step to {}",
                codes.len(),
                codestep
            );
            */
            let next_codes = Self::assign_codes_with_step(codestep, &clu);
            if next_codes.len() <= ncodes {
                codes = next_codes;
            } else {
                break;
            }
        }
        codes
    }

    /// Build a dictionary with a given [Mode] over a provided sample of the
    /// underlying value type.
    ///
    /// The provided sample should be representative of the overall data, and
    /// specifically should contain duplicates at a similar proportion to those
    /// found in the overall data. If the sample is not representative, code
    /// accuracy will degrade (thus false positives will increase) but nothing
    /// else will go wrong.
    ///
    /// This function will sort the sample, so the sample should be small enough
    /// that the caller can tolerate the running time of sorting it. Otherwise
    /// the larger the sample, the more accurate the codes.
    pub fn new(mode: Mode, mut sample: Vec<T>) -> Self {
        // println!("beginning building dictionary from {} samples", sample.len());

        // For an empty sample we haven't much to work with; assign exact code 2
        // for the default value in the target type. Any value less than default
        // will code as 1, any value greater as 3. That's it.
        if sample.is_empty() {
            // println!("empty sample, using 1-element default");
            let codes = vec![<T as Default>::default()];
            return Self { mode, codes };
        }

        // If we have a real sample, we want to sort it both to assign
        // order-preserving codes and to cluster it for frequency analysis.
        sample.sort_unstable();

        // Do the frequency analysis.
        let clu = Self::clusters(&sample);
        assert!(!clu.is_empty());

        /*
        if clu.len() == sample.len()
        {
            println!("samples are all unique");
        }
        else
        {
            println!("reduced {} samples to {} clusters", sample.len(), clu.len());
        }
        */

        let ncodes = mode.num_exact_codes();

        // If there are the same or fewer clusters than the codespace, we can
        // just assign one code per cluster, there's no need for anything
        // fancier.
        if clu.len() <= ncodes {
            /*
            println!(
                "fewer clusters ({}) than target codes {}, using clusters",
                clu.len(), ncodes);
            */
            let codes = clu.into_iter().map(|c| c.value).collect();
            return Self { mode, codes };
        }
        let codes = Self::assign_codes_with_minimal_step(sample.len(), ncodes, &clu);
        // println!("finished building dictionary with {} exact codes", codes.len());
        Self { mode, codes }
    }
}