parquet 57.1.0

Apache Parquet implementation in Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
<!---
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->

# Thrift serialization in the parquet crate

For both performance and flexibility reasons, this crate uses custom Thrift parsers and
serialization mechanisms. For many of the objects defined by the Parquet specification macros
are used to generate the objects as well as the code to serialize them. But in certain instances
(performance bottlenecks, additions to the spec, etc.), it becomes necessary to implement the
serialization code manually. This document serves to document both the standard usage of the
Thrift macros, as well as how to implement custom encoders and decoders.

## Thrift macros

The Parquet specification utilizes Thrift enums, unions, and structs, defined by an Interface
Description Language (IDL). This IDL is usually parsed by a Thrift code generator to produce
language specific structures and serialization/deserialization code. This crate, however, uses
Rust macros to perform the same function. In addition to skipping creation of additional duplicate
structures, doing so allows for customizations that produce more performant code, as well as the
ability to pick and choose which fields to process.

### Enums

Thrift enums are the simplest structure, and are logically identical to Rust enums with unit
variants. The IDL description will look like

```
enum Type {
  BOOLEAN = 0;
  INT32 = 1;
  INT64 = 2;
  INT96 = 3;
  FLOAT = 4;
  DOUBLE = 5;
  BYTE_ARRAY = 6;
  FIXED_LEN_BYTE_ARRAY = 7;
}
```

The `thrift_enum` macro can be used in this instance.

```rust
thrift_enum!(
enum Type {
  BOOLEAN = 0;
  INT32 = 1;
  INT64 = 2;
  INT96 = 3;
  FLOAT = 4;
  DOUBLE = 5;
  BYTE_ARRAY = 6;
  FIXED_LEN_BYTE_ARRAY = 7;
}
);
```

which will produce a public Rust enum

```rust
pub enum Type {
  BOOLEAN,
  INT32,
  INT64,
  INT96,
  FLOAT,
  DOUBLE,
  BYTE_ARRAY,
  FIXED_LEN_BYTE_ARRAY,
}
```

All Rust `enum`s produced with this macro will have `pub` visibility.

### Unions

Thrift unions are a special kind of struct in which only a single field is populated. In this
regard they are much like Rust enums which can have a mix of unit and tuple variants. Because of
this flexibility, specifying unions is a little bit trickier.

Often times a union will be defined for which all the variants are typed with empty structs. For
example the `TimeUnit` union used for `LogicalType`s.

```
struct MilliSeconds {}
struct MicroSeconds {}
struct NanoSeconds {}
union TimeUnit {
  1: MilliSeconds MILLIS
  2: MicroSeconds MICROS
  3: NanoSeconds NANOS
}
```

When serialized, these empty structs become a single `0` (to mark the end of the struct). As an
optimization, and to allow for a simpler interface, the `thrift_union_all_empty` macro can be used.

```rust
thrift_union_all_empty!(
union TimeUnit {
  1: MilliSeconds MILLIS
  2: MicroSeconds MICROS
  3: NanoSeconds NANOS
}
);
```

This macro will ignore the types specified for each variant, and will produce the following Rust
`enum`:

```rust
pub enum TimeUnit {
    MILLIS,
    MICROS,
    NANOS,
}
```

For unions with mixed variant types, some modifications to the IDL are necessary. Take the
definition of `ColumnCryptoMetadata`.

```
struct EncryptionWithFooterKey {
}

struct EncryptionWithColumnKey {
  /** Column path in schema **/
  1: required list<string> path_in_schema

  /** Retrieval metadata of column encryption key **/
  2: optional binary key_metadata
}

union ColumnCryptoMetaData {
  1: EncryptionWithFooterKey ENCRYPTION_WITH_FOOTER_KEY
  2: EncryptionWithColumnKey ENCRYPTION_WITH_COLUMN_KEY
}
```

The `ENCRYPTION_WITH_FOOTER_KEY` variant is typed with an empty struct, while
`ENCRYPTION_WITH_COLUMN_KEY` has the type of a struct with fields. In this case, the `thrift_union`
macro is used.

```rust
thrift_union!(
union ColumnCryptoMetaData {
  1: ENCRYPTION_WITH_FOOTER_KEY
  2: (EncryptionWithColumnKey) ENCRYPTION_WITH_COLUMN_KEY
}
);
```

Here, the type has been omitted for `ENCRYPTION_WITH_FOOTER_KEY` to indicate it should be a unit
variant, while the type for `ENCRYPTION_WITH_COLUMN_KEY` is enclosed in parens. The parens are
necessary to provide a semantic clue to the macro that the identifier is a type. The above will
produce the Rust enum

```rust
pub enum ColumnCryptoMetaData {
    ENCRYPTION_WITH_FOOTER_KEY,
    ENCRYPTION_WITH_COLUMN_KEY(EncryptionWithColumnKey),
}
```

All Rust `enum`s produced with either macro will have `pub` visibility. `thrift_union` also allows
for lifetime annotations, but this capability is not currently utilized.

### Structs

The `thrift_struct` macro is used for structs. This macro is a little more flexible than the others
because it allows for the visibility to be specified, and also allows for lifetimes to be specified
for the defined structs as well as their fields. An example of this is the `SchemaElement` struct.
This is defined in this crate as

```rust
thrift_struct!(
pub(crate) struct SchemaElement<'a> {
  1: optional Type r#type;
  2: optional i32 type_length;
  3: optional Repetition repetition_type;
  4: required string<'a> name;
  5: optional i32 num_children;
  6: optional ConvertedType converted_type;
  7: optional i32 scale
  8: optional i32 precision
  9: optional i32 field_id;
  10: optional LogicalType logical_type
}
);
```

Here the `string` field `name` is given a lifetime annotation, which is then propagated to the
struct definition. Without this annotation, the resultant field would be a `String` type, rather
than a string slice. The visibility of this struct (and all fields) will be `pub(crate)`. The
resultant Rust struct will be

```rust
pub(crate) struct SchemaElement<'a> {
    pub(crate) r#type: Type, // here we've changed the name `type` to `r#type` to avoid reserved words
    pub(crate) type_length: i32,
    pub(crate) repetition_type: Repetition,
    pub(crate) name: &'a str,
    ...
}
```

The lifetime annotations can also be added to list elements, as in

```rust
thrift_struct!(
struct FileMetaData<'a> {
  /** Version of this file **/
  1: required i32 version
  2: required list<'a><SchemaElement> schema;
  3: required i64 num_rows
  4: required list<'a><RowGroup> row_groups
  5: optional list<KeyValue> key_value_metadata
  6: optional string created_by
  7: optional list<ColumnOrder> column_orders;
  8: optional EncryptionAlgorithm encryption_algorithm
  9: optional binary footer_signing_key_metadata
}
);
```

Note that the lifetime annotation precedes the element type specification.

## Serialization traits

Serialization is performed via several Rust traits. On the deserialization, objects implement
the `ReadThrift` trait. This defines a `read_thrift` function that takes a
`ThriftCompactInputProtocol` I/O object as an argument. The `read_thrift` function performs
all steps necessary to deserialize the object from the input stream, and is usually produced by
one of the macros mentioned above.

On the serialization side, the `WriteThrift` and `WriteThriftField` traits are used in conjunction
with a `ThriftCompactOutputProtocol` struct. As above, the Thrift macros produce the necessary
implementations needed to perform serialization.

While the macros can be used in most circumstances, sometimes more control is needed. The following
sections provide information on how to provide custom implementations for the serialization
traits.

### ReadThrift Customization

Thrift enums are serialized as a single `i32` value. The process of reading an enum is straightforward:
read the enum discriminant, and then match on the possible values. For instance, reading the
`ConvertedType` enum becomes:

```rust
impl<'a, R: ThriftCompactInputProtocol<'a>> ReadThrift<'a, R> for ConvertedType {
    fn read_thrift(prot: &mut R) -> Result<Self> {
        let val = prot.read_i32()?;
        Ok(match val {
            0 => Self::UTF8,
            1 => Self::MAP,
            2 => Self::MAP_KEY_VALUE,
            ...
            21 => Self::INTERVAL,
            _ => return Err(general_err!("Unexpected ConvertedType {}", val)),
        })
    }
}
```

The default behavior is to return an error when an unexpected field is encountered. One could,
however, provide an `Unknown` variant if forward compatibility is neeeded in the case of an
evolving enum.

Deserializing structs is more involved, but still fairly easy. A thrift struct is serialized as
repeated `(field_id,field_type,field)` tuples. The `field_id` and `field_type` usually occupy a
single byte, followed by the Thrift encoded field. Because only 4 bits are available for the id,
encoders usually will instead use deltas from the preceding field. If the delta will exceed 15,
then the `field_id` nibble will be set to `0`, and the `field_id` will instead be encoded as a
varint, following the `field_type`. Fields will generally be read in a loop, with the `field_id`
and `field_type` read first, and then the `field_id` used to determine which field to read.
When a `field_id` of `0` is encountered, this marks the end of the struct and processing ceases.
Here is an example of the processing loop:

```rust
    let mut last_field_id = 0i16;
    loop {
        // read the field id and field type. break if we encounter `Stop`
        let field_ident = prot.read_field_begin(last_field_id)?;
        if field_ident.field_type == FieldType::Stop {
            break;
        }
        // match on the field id
        match field_ident.id {
            1 => {
                let val = i32::read_thrift(&mut *prot)?;
                num_values = Some(val);
            }
            2 => {
                let val = Encoding::read_thrift(&mut *prot)?;
                encoding = Some(val);
            }
            3 => {
                let val = Encoding::read_thrift(&mut *prot)?;
                definition_level_encoding = Some(val);
            }
            4 => {
                let val = Encoding::read_thrift(&mut *prot)?;
                repetition_level_encoding = Some(val);
            }
            // Thrift structs are meant to be forward compatible, so do not error
            // here. Instead, simply skip unknown fields.
            _ => {
                prot.skip(field_ident.field_type)?;
            }
        };
        // set the last seen field id to calculate the next field_id
        last_field_id = field_ident.id;
    }
```

Thrift unions are encoded as structs, but only a single field will be encoded. The loop above
can be eliminated, and only the `match` on the id performed. A subsequent call to
`read_field_begin` must return `Stop`, or an error should be returned. Here's an example from
the decoding of the `LogicalType` union:

```rust
    // read the discriminant, error if it is `0`
    let field_ident = prot.read_field_begin(0)?;
    if field_ident.field_type == FieldType::Stop {
        return Err(general_err!("received empty union from remote LogicalType"));
    }
    let ret = match field_ident.id {
        1 => {
            prot.skip_empty_struct()?;
            Self::String
        }
        ...
        _ => {
            // LogicalType needs to be forward compatible, so we have defined an `_Unknown`
            // variant for it. This can return an error if forward compatibility is not desired.
            prot.skip(field_ident.field_type)?;
            Self::_Unknown {
                field_id: field_ident.id,
            }
        }
    };
    // test to ensure there is only one field present
    let field_ident = prot.read_field_begin(field_ident.id)?;
    if field_ident.field_type != FieldType::Stop {
        return Err(general_err!(
            "Received multiple fields for union from remote LogicalType"
        ));
    }
```

### WriteThrift Customization

On the serialization side, there are two traits to implement. The first, `WriteThrift`, is used
for actually serializing the object. The other, `WriteThriftField`, handles serializing objects
as struct fields.

Serializing enums is as simple as writing the discriminant as an `i32`. For example, here is the
custom serialization code for `ConvertedType`:

```rust
impl WriteThrift for ConvertedType {
    const ELEMENT_TYPE: ElementType = ElementType::I32;

    fn write_thrift<W: Write>(&self, writer: &mut ThriftCompactOutputProtocol<W>) -> Result<()> {
        // because we've added NONE, the variant values are off by 1, so correct that here
        writer.write_i32(*self as i32 - 1)
    }
}
```

Structs and unions are serialized by field. When performing the serialization, one needs to keep
track of the last field that has been written, as this is needed to calculate the delta in the
Thrift field header. For required fields this is not strictly necessary, but when writing
optional fields it is. A typical `write_thrift` implementation will look like:

```rust
    fn write_thrift<W: Write>(&self, writer: &mut ThriftCompactOutputProtocol<W>) -> Result<()> {
        // required field f1
        self.f1.write_thrift_field(writer, 1, 0)?; // field_id == 1, last_field_id == 0
        // required field f2
        self.f2.write_thrift_field(writer, 2, 1)?; // field_id == 2, last_field_id == 1
        // final required field f3, we now save the last_field_id, which is returned by write_thrift_field
        let mut last_field_id = self.f3.write_thrift_field(writer, 3, 2)?; // field_id == 3, last_field_id == 2

        // optional field f4
        if let Some(val) = self.f4.as_ref() {
            last_field_id = val.write_thrift_field(writer, 4, last_field_id)?;
        }
        // optional field f5
        if let Some(val) = self.f5.as_ref() {
            last_field_id = val.write_thrift_field(writer, 5, last_field_id)?;
        }
        // write end of struct
        writer.write_struct_end()
    }
```

In most instances, the `WriteThriftField` implementation can be handled by the `write_thrift_field`
macro. The first argument is the unqualified name of an object that implements `WriteThrift`, and
the second is the field type (which will be `FieldType::Struct` for Thrift structs and unions,
and `FieldType::I32` for Thrift enums).

```rust
write_thrift_field!(MyNewStruct, FieldType::Struct);
```

which expands to:

```rust
impl WriteThriftField for MyNewStruct {
    fn write_thrift_field<W: Write>(
        &self,
        writer: &mut ThriftCompactOutputProtocol<W>,
        field_id: i16,
        last_field_id: i16,
    ) -> Result<i16> {
        writer.write_field_begin(FieldType::Struct, field_id, last_field_id)?;
        self.write_thrift(writer)?;
        Ok(field_id)
    }
}
```

### Handling for lists

Lists of serialized objects can usually be read using `parquet_thrift::read_thrift_vec` and written
using the `WriteThrift::write_thrift` implementation for vectors of objects that implement
`WriteThrift`.

When reading a list, one first reads the list header which will provide the number of elements
that have been encoded, and then read elements one at a time.

```rust
    // read the list header
    let list_ident = prot.read_list_begin()?;
    // allocate vector with enough capacity
    let mut page_locations = Vec::with_capacity(list_ident.size as usize);
    // read elements
    for _ in 0..list_ident.size {
        page_locations.push(read_page_location(prot)?);
    }
```

Writing is simply the reverse: write the list header, and then serialize the elements:

```rust
    // write the list header
    writer.write_list_begin(ElementType::Struct, page_locations.len)?;
    // write the elements
    for i in 0..len {
        page_locations[i].write_thrift(writer)?;
    }
```

## More examples

For more examples, the easiest thing to do is to [expand](https://github.com/dtolnay/cargo-expand)
the thrift macros. For instance, to see the implementations generated in the `basic` module, type:

```sh
% cargo expand -p parquet --lib --all-features basic
```