Crate arrow[][src]

Expand description

A native Rust implementation of Apache Arrow, a cross-language development platform for in-memory data.

DataType

Every Array in this crate has an associated DataType, that specifies how its data is layed in memory and represented. Thus, a central enum of this crate is DataType, that contains the set of valid DataTypes in the specification. For example, DataType::Utf8.

Array

The central trait of this package is the dynamically-typed Array that represents a fixed-sized, immutable, Send + Sync Array of nullable elements. An example of such an array is UInt32Array. One way to think about an arrow Array is a Arc<[Option<T>; len]> where T can be anything ranging from an integer to a string, or even another Array.

Arrays have len(), data_type(), and the nullability of each of its elements, can be obtained via is_null(index). To downcast an Array to a specific implementation, you can use

use arrow::array::{Array, UInt32Array};
let array = UInt32Array::from(vec![Some(1), None, Some(3)]);
assert_eq!(array.len(), 3);
assert_eq!(array.value(0), 1);
assert_eq!(array.is_null(1), true);

To make the array dynamically typed, we wrap it in an Arc:

use arrow::datatypes::DataType;
use arrow::array::{UInt32Array, ArrayRef};
let array: ArrayRef = Arc::new(array);
assert_eq!(array.len(), 3);
// array.value() is not available in the dynamically-typed version
assert_eq!(array.is_null(1), true);
assert_eq!(array.data_type(), &DataType::UInt32);

to downcast, use as_any():

let array = array.as_any().downcast_ref::<UInt32Array>().unwrap();
assert_eq!(array.value(0), 1);

Memory and Buffers

Data in Array is stored in ArrayData, that in turn is a collection of other ArrayData and Buffers. Buffers is the central struct that array implementations use keep allocated memory and pointers. The MutableBuffer is the mutable counter-part ofBuffer. These are the lowest abstractions of this crate, and are used throughout the crate to efficiently allocate, write, read and deallocate memory.

Field, Schema and RecordBatch

Field is a struct that contains an array’s metadata (datatype and whether its values can be null), and a name. Schema is a vector of fields with optional metadata. Together, they form the basis of a schematic representation of a group of Arrays.

In fact, RecordBatch is a struct with a Schema and a vector of Arrays, all with the same len. A record batch is the highest order struct that this crate currently offers and is broadly used to represent a table where each column in an Array.

Compute

This crate offers many operations (called kernels) to operate on Arrays, that you can find at compute::kernels. It has both vertical and horizontal operations, and some of them have an SIMD implementation.

Status

This crate has most of the implementation of the arrow specification. Specifically, it supports the following types:

This crate also implements many common vertical operations:

as well as some horizontal operations, such as

Finally, this crate implements some readers and writers to different formats:

The parquet implementation is on a separate crate

Modules

alloc

Defines memory-related functions, such as allocate/deallocate/reallocate memory regions, cache and allocation alignments.

array

The central type in Apache Arrow are arrays, represented by the Array trait. An array represents a known-length sequence of values all having the same type.

bitmap

Defines a bitmap, which is used to track which values in an Arrow array are null. This is called a “validity bitmap” in the Arrow documentation.

buffer

This module contains two main structs: Buffer and MutableBuffer. A buffer represents a contiguous memory region that can be shared via offsets.

compute

Computation kernels on Arrow Arrays

csv

Transfer data between the Arrow memory format and CSV (comma-separated values).

datatypes

Defines the logical data types of Arrow arrays.

error

Defines ArrowError for representing failures in various Arrow operations.

ffi

Contains declarations to bind to the C Data Interface.

ipc
json

Transfer data between the Arrow memory format and JSON line-delimited records. See the module level documentation for the reader and writer for usage examples.

record_batch

A two-dimensional batch of column-oriented data with a defined schema.

temporal_conversions

Conversion methods for dates and times.

tensor

Arrow Tensor Type, defined in format/Tensor.fbs.

util