Expand description
Read and write Vortex layouts, a serialization of Vortex arrays.
A layout is a serialized array which is stored in some linear and contiguous block of memory. Layouts are recursively defined in terms of one of three kinds:
-
The
FlatLayout
. A contiguously serialized array of buffers, with a specific in-memoryAlignment
. -
The
StructLayout
. Each column of aStructArray
is sequentially laid out at known offsets. This permits reading a subset of columns in time linear in the number of kept columns. -
The
ChunkedLayout
. Each chunk of aChunkedArray
is sequentially laid out at known offsets. This permits reading a subset of rows in time linear in the number of kept rows.
A layout, alone, is not a standalone Vortex file because layouts are not self-describing. They
neither contain a description of the kind of layout (e.g. flat, column of flat, chunked of
column of flat) nor a data type (DType
).
§Reading
Reading is implemented by VortexFile
. It’s “opened” by VortexOpenOptions
, which can be provided with information about’s the file’s
structure to save on IO before the actual data read. Once the file is open and has done the initial IO work to understand its own structure,
it can be turned into a stream by calling VortexFile::scan
with a Scan
, which defines filtering and projection on the file.
The file manages IO-oriented work and CPU-oriented work on two different underlying runtimes, which are configurable and pluggable with multiple provided implementations (Tokio, Rayon etc.).
It also caches buffers between stages of the scan, saving on duplicate IO. The cache can also be reused between scans of the same file (See SegmentCache
).
§File Format
Succinctly, the file format specification is as follows:
- Data is written first, in a form that is describable by a Layout (typically Array IPC Messages). a. To allow for more efficient IO & pruning, our writer implementation first writes the “data” arrays, and then writes the “metadata” arrays (i.e., per-column statistics)
- We write what is collectively referred to as the “Footer”, which contains: a. An optional Schema, which if present is a valid flatbuffer representing a message::Schema b. The Layout, which is a valid footer::Layout flatbuffer, and describes the physical byte ranges & relationships amongst the those byte ranges that we wrote in part 1. c. The Postscript, which is a valid footer::Postscript flatbuffer, containing the absolute start offsets of the Schema & Layout flatbuffers within the file. d. The End-of-File marker, which is 8 bytes, and contains the u16 version, u16 postscript length, and 4 magic bytes.
§Reified File Format
┌────────────────────────────┐
│ │
│ Data │
│ (Array IPC Messages) │
│ │
├────────────────────────────┤
│ │
│ Per-Column Statistics │
│ │
├────────────────────────────┤
│ │
│ Schema Flatbuffer │
│ │
├────────────────────────────┤
│ │
│ Layout Flatbuffer │
│ │
├────────────────────────────┤
│ │
│ Postscript Flatbuffer │
│ (Schema & Layout Offsets) │
│ │
├────────────────────────────┤
│ 8-byte End of File │
│(Version, Postscript Length,│
│ Magic Bytes) │
└────────────────────────────┘
A Parquet-style file format is realized by using a chunked layout containing column layouts containing chunked layouts containing flat layouts. The outer chunked layout represents row groups. The inner chunked layout represents pages.
All the chunks of a chunked layout and all the columns of a column layout need not use the same layout.
Anything implementing VortexReadAt
, for example local files, byte
buffers, and cloud storage, can be used as the “linear and
contiguous memory”.
§Apache Arrow
If you ultimately seek Arrow arrays, VortexRecordBatchReader
converts an open
Vortex file into a RecordBatchReader
.
Modules§
Structs§
- File
Layout - Captures the layout information of a Vortex file.
- Scan
- Segment
- The location of a segment within a Vortex file.
- Vortex
File - A Vortex file ready for reading.
- Vortex
Open Options - Open options for a Vortex file reader.
- Vortex
Write Options
Enums§
- Execution
Mode - The
ExecutionMode
describes how the CPU-bound layout evaluation tasks are executed. Typically, there is one task per file split (row-group). - SplitBy
- Defines how the Vortex file is split into batches for reading.
Constants§
- CHUNKED_
LAYOUT_ ID - The layout ID for a chunked layout
- COLUMNAR_
LAYOUT_ ID - The layout ID for a column layout
- EOF_
SIZE - The size of the EOF marker in bytes
- FLAT_
LAYOUT_ ID - The layout ID for a flat layout
- MAGIC_
BYTES - The magic bytes for a Vortex file
- MAX_
FOOTER_ SIZE - The maximum length of a Vortex footer in bytes
- V1_
FOOTER_ FBS_ SIZE - The size of the footer in bytes in Vortex version 1
- VERSION
- The current version of the Vortex file format
- VORTEX_
FILE_ EXTENSION - The extension for Vortex files