Sorting Parquet Writer
A Rust library for writing sorted Parquet files with bounded memory usage. Inspired by Parquet-Go's SortingWriter.
Features
- Globally sorted output via external merge sort (
SortingParquetWriter) - Per-row-group sorting for lighter-weight optimization (
SortedGroupsParquetWriter) - Bounded memory — configurable row buffer with automatic spill to temporary run files
- Streaming k-way merge — final merge reads one batch per run file at a time
- Progress tracking — callback-based progress reporting during the merge phase
- Supports int, uint, float, bool, string, and list column types
Quick Start
use ;
use WriterProperties;
use SortingColumn;
use ;
use Arc;
let schema: SchemaRef = new;
let props = builder
.set_sorting_columns
.build;
let file = create.unwrap;
let mut writer = try_new.unwrap;
// Write batches in any order — they will be sorted automatically
// writer.write(&batch)?;
// Finalize: merges all sorted runs into the output file
// let file = writer.finish()?;
Writers
SortingParquetWriter
Produces a globally sorted Parquet file using external merge sort:
- Write phase — buffers incoming
RecordBatches in memory. When the configuredFlushThresholdis reached (row count, byte size, or either), the buffer is sorted and flushed to a temporary run file on disk. - Merge phase (
finish()) — all sorted run files are merged via a streaming k-way merge into the final output.
Configure via SortingWriterOptions:
use ;
let options = SortingWriterOptions ;
Progress Tracking
Use finish_with_progress to monitor the merge phase:
use FinishProgress;
#
SortedGroupsParquetWriter
Sorts individual row groups without guaranteeing global sort order. Lighter weight than SortingParquetWriter — no temporary files needed. Useful when queries primarily filter within row groups.
Examples
sort-parquet — Sort a Parquet file
# With custom memory limit
sort-checker — Verify sort order
Limitations
- Only supports int, uint, float, bool, string, and list types. Other Arrow types will produce an error during the merge process.
License
Apache-2.0 OR MIT