Module repdef

Module repdef 

Source
Expand description

Utilities for rep-def levels

Repetition and definition levels are a way to encode multipile validity / offsets arrays into a single buffer. They are a form of “zipping” buffers together that takes advantage of the fact that, if the outermost array is invalid, then the validity of the inner items is irrelevant.

Note: the concept of repetition & definition levels comes from the Dremel paper and has been implemented in Apache Parquet. However, the implementation here is not necessarily compatible with Parquet. For example, we use 0 to represent the “inner-most” item and Parquet uses 0 to represent the “outer-most” item.

§Repetition Levels

With repetition levels we convert a sparse array of offsets into a dense array of levels. These levels are marked non-zero whenever a new list begins. In other words, given the list array with 3 rows [{<0,1>, <>, <2>}, {<3>}, {}], [], [{<4>}] we would have three offsets arrays:

Outer-most ([]): [0, 3, 3, 4] Middle ({}): [0, 3, 4, 4, 5] Inner (<>): [0, 2, 2, 3, 4, 5] Values : [0, 1, 2, 3, 4]

We can convert these into repetition levels as follows:

ValuesRepetition
03
10
-1
21
32
-2
-3
40

Note: We actually have MORE repetition levels than values. This is because the repetition levels need to be able to represent empty lists.

§Definition Levels

Definition levels are simpler. We can think of them as zipping together various validity bitmaps (from different levels of nesting) into a single buffer. For example, we could zip the arrays [1, 1, 0, 0] and [1, 0, 1, 0] into [11, 10, 01, 00]. However, 00 and 01 are redundant. If the outer level is null then the validity of the inner levels is irrelevant. To save space we instead encode a “level” which is the “depth” of the null. Let’s look at a more complete example:

Array: [{“middle”: {“inner”: 1]}}, NULL, {“middle”: NULL}, {“middle”: {“inner”: NULL}}]

In Arrow we would have the following validity arrays: Outer validity : 1, 0, 1, 1 Middle validity: 1, ?, 0, 1 Inner validity : 1, ?, ?, 0 Values : 1, ?, ?, ?

The ? values are undefined in the Arrow format. We can convert these into definition levels as follows:

ValuesDefinition
10
-3
-2
-1

§Compression

Note that we only need 2 bits of definition levels to represent 3 levels of nesting. Definition levels are always more compact than the input validity arrays. However, compressed levels are not necessarily more compact than the compressed validity arrays.

Repetition levels are more complex. If there are very large lists then a sparse array of offsets (which has one element per list) might be more compact than a dense array of repetition levels (which has one element per list value, possibly even more if there are empty lists).

However, both repetition levels and definition levels are typically very compressible with RLE.

However, in Lance we don’t always take advantage of that compression because we want to be able to zip rep-def levels together with our values. This gives us fewer IOPS when accessing row values.

§Utilities in this Module

  • RepDefBuilder - Extracts validity and offset information from Arrow arrays. We use this as we shred the incoming data into primitive leaf arrays. We don’t immediately convert into rep-def because we need to share and cheaply clone the builder when we have structs (each struct child shares some parent validity / offset information) The serialize method is called once all data has been received to create the final rep-def levels.

  • SerializerContext - This is an internal utility that helps with serializing rep-def levels.

  • CompositeRepDefUnraveler - This structure is used to reverse the process. It starts with a set of rep-def levels and then uses the unravel_validity and unravel_offsets methods to produce validity buffers and offset buffers. It is “composite” because we may be combining sets of rep-def buffers from multiple locations (e.g. multiple blocks in a mini-block encoded file).

  • RepDefSlicer - This is a utility that helps with slicing rep-def buffers. These buffers are “kind of” transparent (maps 1:1 with the values in the array) but not exactly because of the special (empty/null) lists. The slicer helps with this issue and is used when slicing a rep-def buffer into mini-blocks.

  • build_control_word_iterator - This takes in rep-def levels and returns an iterator that returns byte-padded “control words” which are used when creating full-zip encoded data.

  • ControlWordParser - This parser can parse the control words returned by build_control_word_iterator and is used when decoding full-zip encoded data.

Structs§

BinaryControlWordIterator
A ControlWordIterator when there are both repetition and definition levels
CompositeRepDefUnraveler
As we decode we may extract rep/def information from multiple pages (or multiple chunks within a page).
ControlWordDesc
Describes the properties of a control word
NilaryControlWordIterator
A ControlWordIterator when there are no repetition or definition levels
RepDefBuilder
A structure used to collect validity buffers and offsets from arrow arrays and eventually create repetition and definition levels
RepDefSlicer
Slices a level buffer into pieces
RepDefUnraveler
Starts with serialized repetition and definition levels and unravels them into validity buffers and offsets buffers
SerializedRepDefs
Represents repetition and definition levels that have been serialized into a pair of (optional) level buffers
UnaryControlWordIterator
A ControlWordIterator when there are only definition levels or only repetition levels

Enums§

ControlWordIterator
An iterator that generates control words from repetition and definition levels
ControlWordParser
A parser to unwrap control words into repetition and definition levels
DefinitionInterpretation
This tells us how an array handles definition. Given a stack of these and a nested array and a set of definition levels we can calculate how we should interpret the definition levels.

Functions§

build_control_word_iterator
Builds a ControlWordIterator from repetition and definition levels by first calculating the width needed and then creating the iterator with the appropriate width

Type Aliases§

LevelBuffer