sorting_parquet_writer/lib.rs
1//! # Sorting Parquet Writer
2//!
3//! A library for writing sorted Parquet files with bounded memory usage,
4//! inspired by [Parquet-Go's SortingWriter](https://pkg.go.dev/github.com/parquet-go/parquet-go#SortingWriter).
5//!
6//! ## Writers
7//!
8//! - [`writers::SortingParquetWriter`] — produces a **globally sorted** Parquet file
9//! using external merge sort. Data is buffered in memory, periodically sorted and
10//! spilled to temporary run files, then merged via streaming k-way merge at finalization.
11//!
12//! - [`writers::SortedGroupsParquetWriter`] — sorts **individual row groups** without
13//! guaranteeing global order. No temporary files needed.
14//!
15//! ## Sorting utilities
16//!
17//! - [`sorting::sort_record_batch`] — sort a single [`RecordBatch`](arrow::array::RecordBatch)
18//! by the given sorting columns.
19//! - [`record_batch::merge_sorted_batches`] — k-way merge of pre-sorted batches into one.
20//!
21//! ## Progress tracking
22//!
23//! [`writers::SortingParquetWriter::finish_with_progress`] accepts any
24//! [`writers::FinishProgressHandler`] (including closures) for monitoring the merge phase.
25//!
26//! ## Errors
27//!
28//! All fallible APIs return [`SortingParquetError`], which transparently
29//! wraps Arrow, Parquet, and `std::io::Error`s alongside a few crate-specific
30//! variants.
31//!
32//! ## Example
33//!
34//! ```rust,no_run
35//! use sorting_parquet_writer::writers::SortingParquetWriter;
36//! use parquet::file::properties::WriterProperties;
37//! use parquet::file::metadata::SortingColumn;
38//! use arrow::datatypes::{Schema, Field, DataType, SchemaRef};
39//! use std::sync::Arc;
40//!
41//! let schema: SchemaRef = Arc::new(Schema::new(vec![
42//! Field::new("id", DataType::Int32, false),
43//! ]));
44//! let props = WriterProperties::builder()
45//! .set_sorting_columns(Some(vec![SortingColumn {
46//! column_idx: 0, descending: false, nulls_first: false,
47//! }]))
48//! .build();
49//!
50//! let file = std::fs::File::create("output.parquet").unwrap();
51//! let mut writer = SortingParquetWriter::try_new(file, schema, props).unwrap();
52//! // writer.write(&batch)?;
53//! // let file = writer.finish()?;
54//! ```
55
56mod error;
57pub use error::*;
58pub mod record_batch;
59pub mod sorting;
60#[cfg(test)]
61pub mod test;
62mod utils;
63pub mod writers;