Expand description

Provides API for reading/writing Arrow RecordBatches and Arrays to/from Parquet Files.

Apache Arrow is a cross-language development platform for in-memory data.

Example of writing Arrow record batch to Parquet file

 use arrow::array::{Int32Array, ArrayRef};
 use arrow::record_batch::RecordBatch;
 use parquet::arrow::arrow_writer::ArrowWriter;
 use parquet::file::properties::WriterProperties;
 use std::fs::File;
 use std::sync::Arc;
 let ids = Int32Array::from(vec![1, 2, 3, 4]);
 let vals = Int32Array::from(vec![5, 6, 7, 8]);
 let batch = RecordBatch::try_from_iter(vec![
   ("id", Arc::new(ids) as ArrayRef),
   ("val", Arc::new(vals) as ArrayRef),
 ]).unwrap();

 let file = File::create("data.parquet").unwrap();

 // Default writer properties
 let props = WriterProperties::builder().build();

 let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();

 writer.write(&batch).expect("Writing batch");

 // writer must be closed to write footer
 writer.close().unwrap();

WriterProperties can be used to set Parquet file options

use parquet::file::properties::WriterProperties;
use parquet::basic::{ Compression, Encoding };
use parquet::file::properties::WriterVersion;

// File compression
let props = WriterProperties::builder()
    .set_compression(Compression::SNAPPY)
    .build();

Example of reading parquet file into arrow record batch

use arrow::record_batch::RecordBatchReader;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::arrow::{ParquetFileArrowReader, ArrowReader, ProjectionMask};
use std::sync::Arc;
use std::fs::File;


let file = File::open("data.parquet").unwrap();

let mut arrow_reader = ParquetFileArrowReader::try_new(file).unwrap();
let mask = ProjectionMask::leaves(arrow_reader.parquet_schema(), [0]);

println!("Converted arrow schema is: {}", arrow_reader.get_schema().unwrap());
println!("Arrow schema after projection is: {}",
arrow_reader.get_schema_by_columns(mask.clone()).unwrap());

let mut unprojected = arrow_reader.get_record_reader(2048).unwrap();
println!("Unprojected reader schema: {}", unprojected.schema());

let mut record_batch_reader = arrow_reader.get_record_reader_by_columns(mask, 2048).unwrap();

for maybe_record_batch in record_batch_reader {
   let record_batch = maybe_record_batch.unwrap();
   if record_batch.num_rows() > 0 {
       println!("Read {} records.", record_batch.num_rows());
   } else {
       println!("End of file!");
   }
}

Re-exports

pub use self::arrow_reader::ArrowReader;
Deprecated
pub use self::arrow_reader::ParquetFileArrowReader;
Deprecated
pub use self::arrow_writer::ArrowWriter;
pub use self::async_reader::ParquetRecordBatchStreamBuilder;

Modules

Contains reader which reads parquet data into arrow RecordBatch

Contains writer which writes arrow data into parquet data.

Provides async API for reading parquet files as RecordBatches

Structs

A ProjectionMask identifies a set of columns within a potentially nested schema to project

Constants

Schema metadata key used to store serialized Arrow IPC schema

Functions

Convert arrow schema to parquet schema

Convert Parquet schema to Arrow schema including optional metadata. Attempts to decode any existing Arrow schema metadata, falling back to converting the Parquet schema column-wise

Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.