Expand description
Utilities for rep-def levels
Repetition and definition levels are a way to encode multipile validity / offsets arrays into a single buffer. They are a form of “zipping” buffers together that takes advantage of the fact that, if the outermost array is invalid, then the validity of the inner items is irrelevant.
Note: the concept of repetition & definition levels comes from the Dremel paper and has been implemented in Apache Parquet. However, the implementation here is not necessarily compatible with Parquet. For example, we use 0 to represent the “inner-most” item and Parquet uses 0 to represent the “outer-most” item.
§Repetition Levels
With repetition levels we convert a sparse array of offsets into a dense array of levels. These levels are marked non-zero whenever a new list begins. In other words, given the list array with 3 rows [{<0,1>, <>, <2>}, {<3>}, {}], [], [{<4>}] we would have three offsets arrays:
Outer-most ([]): [0, 3, 3, 4] Middle ({}): [0, 3, 4, 4, 5] Inner (<>): [0, 2, 2, 3, 4, 5] Values : [0, 1, 2, 3, 4]
We can convert these into repetition levels as follows:
| Values | Repetition |
|---|---|
| 0 | 3 |
| 1 | 0 |
| - | 1 |
| 2 | 1 |
| 3 | 2 |
| - | 2 |
| - | 3 |
| 4 | 0 |
Note: We actually have MORE repetition levels than values. This is because the repetition levels need to be able to represent empty lists.
§Definition Levels
Definition levels are simpler. We can think of them as zipping together various validity bitmaps (from different levels of nesting) into a single buffer. For example, we could zip the arrays [1, 1, 0, 0] and [1, 0, 1, 0] into [11, 10, 01, 00]. However, 00 and 01 are redundant. If the outer level is null then the validity of the inner levels is irrelevant. To save space we instead encode a “level” which is the “depth” of the null. Let’s look at a more complete example:
Array: [{“middle”: {“inner”: 1]}}, NULL, {“middle”: NULL}, {“middle”: {“inner”: NULL}}]
In Arrow we would have the following validity arrays: Outer validity : 1, 0, 1, 1 Middle validity: 1, ?, 0, 1 Inner validity : 1, ?, ?, 0 Values : 1, ?, ?, ?
The ? values are undefined in the Arrow format. We can convert these into definition levels as follows:
| Values | Definition |
|---|---|
| 1 | 0 |
| - | 3 |
| - | 2 |
| - | 1 |
§Compression
Note that we only need 2 bits of definition levels to represent 3 levels of nesting. Definition levels are always more compact than the input validity arrays. However, compressed levels are not necessarily more compact than the compressed validity arrays.
Repetition levels are more complex. If there are very large lists then a sparse array of offsets (which has one element per list) might be more compact than a dense array of repetition levels (which has one element per list value, possibly even more if there are empty lists).
However, both repetition levels and definition levels are typically very compressible with RLE.
However, in Lance we don’t always take advantage of that compression because we want to be able to zip rep-def levels together with our values. This gives us fewer IOPS when accessing row values.
§Utilities in this Module
-
RepDefBuilder- Extracts validity and offset information from Arrow arrays. We use this as we shred the incoming data into primitive leaf arrays. We don’t immediately convert into rep-def because we need to share and cheaply clone the builder when we have structs (each struct child shares some parent validity / offset information) Theserializemethod is called once all data has been received to create the final rep-def levels. -
SerializerContext- This is an internal utility that helps with serializing rep-def levels. -
CompositeRepDefUnraveler- This structure is used to reverse the process. It starts with a set of rep-def levels and then uses theunravel_validityandunravel_offsetsmethods to produce validity buffers and offset buffers. It is “composite” because we may be combining sets of rep-def buffers from multiple locations (e.g. multiple blocks in a mini-block encoded file). -
RepDefSlicer- This is a utility that helps with slicing rep-def buffers. These buffers are “kind of” transparent (maps 1:1 with the values in the array) but not exactly because of the special (empty/null) lists. The slicer helps with this issue and is used when slicing a rep-def buffer into mini-blocks. -
build_control_word_iterator- This takes in rep-def levels and returns an iterator that returns byte-padded “control words” which are used when creating full-zip encoded data. -
ControlWordParser- This parser can parse the control words returned bybuild_control_word_iteratorand is used when decoding full-zip encoded data.
Structs§
- Binary
Control Word Iterator - A
ControlWordIteratorwhen there are both repetition and definition levels - Composite
RepDef Unraveler - As we decode we may extract rep/def information from multiple pages (or multiple chunks within a page).
- Control
Word Desc - Describes the properties of a control word
- Nilary
Control Word Iterator - A
ControlWordIteratorwhen there are no repetition or definition levels - RepDef
Builder - A structure used to collect validity buffers and offsets from arrow arrays and eventually create repetition and definition levels
- RepDef
Slicer - Slices a level buffer into pieces
- RepDef
Unraveler - Starts with serialized repetition and definition levels and unravels them into validity buffers and offsets buffers
- Serialized
RepDefs - Represents repetition and definition levels that have been serialized into a pair of (optional) level buffers
- Unary
Control Word Iterator - A
ControlWordIteratorwhen there are only definition levels or only repetition levels
Enums§
- Control
Word Iterator - An iterator that generates control words from repetition and definition levels
- Control
Word Parser - A parser to unwrap control words into repetition and definition levels
- Definition
Interpretation - This tells us how an array handles definition. Given a stack of these and a nested array and a set of definition levels we can calculate how we should interpret the definition levels.
Functions§
- build_
control_ word_ iterator - Builds a
ControlWordIteratorfrom repetition and definition levels by first calculating the width needed and then creating the iterator with the appropriate width