Crate orcxx

source ·
Expand description

Rust wrapper for the Apache ORC C++ library.

Currently, it only allows reading files, not writing.

ORC, short for Optimized Row Columnar, is a column-oriented data storage format. As such, most of the APIs in this library operate on columns, rather than rows. In order to work on rows, readers need to “zip” columns together.

§Compiling

This crate uses a submodule pointing to an Apache ORC release, builds its C++ part (including vendored protobuf, lz4, zstd, …), and links against that, unless the ORC_USE_SYSTEM_LIBRARIES environment variable is set. If it is, you need to make sure the dependencies are installed (apt-get install libprotoc-dev liblz4-dev libsnappy-dev libzstd-dev zlib1g-dev on Debian-based distributions).

§Usage principles

reader contains the entry points to parse a file, and reads into a OwnedColumnVectorBatch structure, which can be .borrow()ed to get a BorrowedColumnVectorBatch, which implements most of the operations.

This structure is untyped, and needs to be cast into the correct type, by calling try_into_longs(), try_into_strings(), try_into_structs(), etc.

While this works when parsing files whose structure is known, this is not very practical. The StructuredRowReader offers an abstraction over RowReader, which reads the schema of the file (through selected_kind()) and dynamically casts the vectors into the right type, recursively, in a ColumnTree.

For row-oriented access, see the orcxx_derive crate, which allows #[derive(OrcDeserialize)] on structures in order to deserialize ORC files into a structure instance for each row. These structures can be deserialized either directly into vector batches with deserialize::OrcDeserialize::read_from_vector_batch, or iterated through row_iterator::RowIterator.

§Panics

May panic when requesting vector batches larger than isize; this includes vector batches for variable-sized columns (maps and lists). This is unlikely to happen on 64-bits machines (they would OOM first).

row_iterator::RowIterator panics when underlying calls to deserialize::OrcDeserialize::read_from_vector_batch error (so you may want to avoid the former when working with non-trusted data).

Panics may happen when the C++ library doesn’t behave as expected, too. C++ exceptions should be converted to Rust Results, though.

§Examples

See the orcxx_derive documentation for more high-level examples and documentation.

use orcxx::reader;
use orcxx::vector::ColumnVectorBatch;

let input_stream = reader::InputStream::from_local_file("orc/examples/TestOrcFile.test1.orc")
    .expect("Could not open");

let reader = reader::Reader::new(input_stream).expect("Could not read");

println!("{:#?}", reader.kind()); // Prints the type of columns in the file

let mut row_reader = reader.row_reader(&reader::RowReaderOptions::default()).unwrap();
let mut batch = row_reader.row_batch(1024);

let mut total_elements = 0;
let mut all_strings: Vec<String> = Vec::new();
while row_reader.read_into(&mut batch) {
    total_elements += (&batch).num_elements();

    let struct_vector = batch.borrow().try_into_structs().unwrap();
    let vectors = struct_vector.fields();

    for vector in vectors {
        match vector.try_into_strings() {
            Ok(string_vector) => {
                for s in string_vector.iter() {
                    all_strings.push(
                        std::str::from_utf8(s.unwrap_or(b"<null>"))
                        .unwrap().to_owned())
                }
            }
            Err(e) => {}
        }
    }
}

assert_eq!(total_elements, 2);
assert_eq!(
    all_strings,
    vec!["\0\u{1}\u{2}\u{3}\u{4}", "", "hi", "bye"]
        .iter()
        .map(|s| s.to_owned())
        .collect::<Vec<_>>()
);

Modules§

  • Helpers for the orcxx_derive crate.
  • Contains structures to represent ORC types.
  • Low-level column-oriented parser for ORC files.
  • Iterator on ORC rows.
  • Dynamically-typed column-oriented parser for ORC files
  • Containers for columns of values of uniform types

Structs§