Skip to main content

Crate sorting_parquet_writer

Crate sorting_parquet_writer 

Source
Expand description

§Sorting Parquet Writer

A library for writing sorted Parquet files with bounded memory usage, inspired by Parquet-Go’s SortingWriter.

§Writers

  • writers::SortingParquetWriter — produces a globally sorted Parquet file using external merge sort. Data is buffered in memory, periodically sorted and spilled to temporary run files, then merged via streaming k-way merge at finalization.

  • writers::SortedGroupsParquetWriter — sorts individual row groups without guaranteeing global order. No temporary files needed.

§Sorting utilities

§Progress tracking

writers::SortingParquetWriter::finish_with_progress accepts any writers::FinishProgressHandler (including closures) for monitoring the merge phase.

§Errors

All fallible APIs return SortingParquetError, which transparently wraps Arrow, Parquet, and std::io::Errors alongside a few crate-specific variants.

§Example

use sorting_parquet_writer::writers::SortingParquetWriter;
use parquet::file::properties::WriterProperties;
use parquet::file::metadata::SortingColumn;
use arrow::datatypes::{Schema, Field, DataType, SchemaRef};
use std::sync::Arc;

let schema: SchemaRef = Arc::new(Schema::new(vec![
    Field::new("id", DataType::Int32, false),
]));
let props = WriterProperties::builder()
    .set_sorting_columns(Some(vec![SortingColumn {
        column_idx: 0, descending: false, nulls_first: false,
    }]))
    .build();

let file = std::fs::File::create("output.parquet").unwrap();
let mut writer = SortingParquetWriter::try_new(file, schema, props).unwrap();
// writer.write(&batch)?;
// let file = writer.finish()?;

Modules§

record_batch
In-memory k-way merge of pre-sorted RecordBatches.
sorting
Single-batch sorting primitives built on top of the arrow_row row format.
writers
Parquet writers that produce sorted output.

Enums§

SortingParquetError
The unified error type produced by this crate.