Crate arrow_array

Source
Expand description

The central type in Apache Arrow are arrays, which are a known-length sequence of values all having the same type. This crate provides concrete implementations of each type, as well as an Array trait that can be used for type-erasure.

§Building an Array

Most Array implementations can be constructed directly from iterators or Vec

Int32Array::from(vec![1, 2]);
Int32Array::from(vec![Some(1), None]);
Int32Array::from_iter([1, 2, 3, 4]);
Int32Array::from_iter([Some(1), Some(2), None, Some(4)]);

StringArray::from(vec!["foo", "bar"]);
StringArray::from(vec![Some("foo"), None]);
StringArray::from_iter([Some("foo"), None]);
StringArray::from_iter_values(["foo", "bar"]);

ListArray::from_iter_primitive::<Int32Type, _, _>([
    Some(vec![Some(1), None, Some(3)]),
    None,
    Some(vec![])
]);

Additionally ArrayBuilder implementations can be used to construct arrays with a push-based interface

// Create a new builder with a capacity of 100
let mut builder = Int16Array::builder(100);

// Append a single primitive value
builder.append_value(1);
// Append a null value
builder.append_null();
// Append a slice of primitive values
builder.append_slice(&[2, 3, 4]);

// Build the array
let array = builder.finish();

assert_eq!(5, array.len());
assert_eq!(2, array.value(2));
assert_eq!(&array.values()[3..5], &[3, 4])

§Low-level API

Internally, arrays consist of one or more shared memory regions backed by a Buffer, the number and meaning of which depend on the array’s data type, as documented in the Arrow specification.

For example, the type Int16Array represents an array of 16-bit integers and consists of:

Similarly, the type StringArray represents an array of UTF-8 strings and consists of:

  • An optional NullBuffer identifying any null values
  • An offsets OffsetBuffer<i32> identifying valid UTF-8 sequences within the values buffer
  • A values Buffer of UTF-8 encoded string data

Array constructors such as PrimitiveArray::try_new provide the ability to cheaply construct an array from these parts, with functions such as PrimitiveArray::into_parts providing the reverse operation.

// Create a Int32Array from Vec without copying
let array = Int32Array::new(vec![1, 2, 3].into(), None);
assert_eq!(array.values(), &[1, 2, 3]);
assert_eq!(array.null_count(), 0);

// Create a StringArray from parts
let offsets = OffsetBuffer::new(vec![0, 5, 10].into());
let array = StringArray::new(offsets, b"helloworld".into(), None);
let values: Vec<_> = array.iter().map(|x| x.unwrap()).collect();
assert_eq!(values, &["hello", "world"]);

As Buffer, and its derivatives, can be created from Vec without copying, this provides an efficient way to not only interoperate with other Rust code, but also implement kernels optimised for the arrow data layout - e.g. by handling buffers instead of values.

§Zero-Copy Slicing

Given an Array of arbitrary length, it is possible to create an owned slice of this data. Internally this just increments some ref-counts, and so is incredibly cheap

let array = Int32Array::from_iter([1, 2, 3]);

// Slice with offset 1 and length 2
let sliced = array.slice(1, 2);
assert_eq!(sliced.values(), &[2, 3]);

§Downcasting an Array

Arrays are often passed around as a dynamically typed &dyn Array or ArrayRef. For example, RecordBatch stores columns as ArrayRef.

Whilst these arrays can be passed directly to the compute, csv, json, etc… APIs, it is often the case that you wish to interact with the concrete arrays directly.

This requires downcasting to the concrete type of the array:


// Safely downcast an `Array` to an `Int32Array` and compute the sum
// using native i32 values
fn sum_int32(array: &dyn Array) -> i32 {
    let integers: &Int32Array = array.as_any().downcast_ref().unwrap();
    integers.iter().map(|val| val.unwrap_or_default()).sum()
}

// Safely downcasts the array to a `Float32Array` and returns a &[f32] view of the data
// Note: the values for positions corresponding to nulls will be arbitrary (but still valid f32)
fn as_f32_slice(array: &dyn Array) -> &[f32] {
    array.as_any().downcast_ref::<Float32Array>().unwrap().values()
}

The cast::AsArray extension trait can make this more ergonomic


fn as_f32_slice(array: &dyn Array) -> &[f32] {
    array.as_primitive::<Float32Type>().values()
}

§Alternatives to ChunkedArray Support

The Rust implementation does not provide the ChunkedArray abstraction implemented by the Python and C++ Arrow implementations. The recommended alternative is to use one of the following:

  • Vec<ArrayRef> a simple, eager version of a ChunkedArray
  • impl Iterator<Item=ArrayRef> a lazy version of a ChunkedArray
  • impl Stream<Item=ArrayRef> a lazy async version of a ChunkedArray

Similar patterns can be applied at the RecordBatch level. For example, DataFusion makes extensive use of RecordBatchStream.

This approach integrates well into the Rust ecosystem, simplifies the implementation and encourages the use of performant lazy and async patterns.

use std::sync::Arc;
use arrow_array::{ArrayRef, Float32Array, RecordBatch, StringArray};
use arrow_array::cast::AsArray;
use arrow_array::types::Float32Type;
use arrow_schema::DataType;

let batches = [
   RecordBatch::try_from_iter(vec![
        ("label", Arc::new(StringArray::from(vec!["A", "B", "C"])) as ArrayRef),
        ("value", Arc::new(Float32Array::from(vec![0.1, 0.2, 0.3])) as ArrayRef),
    ]).unwrap(),
   RecordBatch::try_from_iter(vec![
        ("label", Arc::new(StringArray::from(vec!["D", "E"])) as ArrayRef),
        ("value", Arc::new(Float32Array::from(vec![0.4, 0.5])) as ArrayRef),
   ]).unwrap(),
];

let labels: Vec<&str> = batches
   .iter()
   .flat_map(|batch| batch.column(0).as_string::<i32>())
   .map(Option::unwrap)
   .collect();

let values: Vec<f32> = batches
   .iter()
   .flat_map(|batch| batch.column(1).as_primitive::<Float32Type>().values())
   .copied()
   .collect();

assert_eq!(labels, ["A", "B", "C", "D", "E"]);
assert_eq!(values, [0.1, 0.2, 0.3, 0.4, 0.5]);

Modules§

array
The concrete array definitions
builder
Defines push-based APIs for constructing arrays
cast
Defines helper functions for downcasting dyn Array to concrete types
iterator
Idiomatic iterators for Array
run_iterator
Idiomatic iterator for RunArray
temporal_conversions
Conversion methods for dates and times.
timezone
Timezone for timestamp arrays
types
Zero-sized types used to parameterize generic array implementations

Macros§

create_array
Creates an array from a literal slice of values, suitable for rapid testing and development.
downcast_dictionary_array
Downcast an Array to a DictionaryArray based on its DataType, accepts a number of subsequent patterns to match the data type
downcast_integer
Given one or more expressions evaluating to an integer DataType invokes the provided macro m with the corresponding integer ArrowPrimitiveType, followed by any additional arguments
downcast_integer_array
Given one or more expressions evaluating to an integer PrimitiveArray invokes the provided macro with the corresponding array, along with match statements for any non integer array types
downcast_primitive
Given one or more expressions evaluating to primitive DataType invokes the provided macro m with the corresponding ArrowPrimitiveType, followed by any additional arguments
downcast_primitive_array
Downcast an Array to a PrimitiveArray based on its DataType accepts a number of subsequent patterns to match the data type
downcast_run_array
Downcast an Array to a RunArray based on its DataType, accepts a number of subsequent patterns to match the data type
downcast_run_end_index
Given one or more expressions evaluating to an integer DataType invokes the provided macro m with the corresponding integer RunEndIndexType, followed by any additional arguments
downcast_temporal
Given one or more expressions evaluating to primitive DataType invokes the provided macro m with the corresponding ArrowPrimitiveType, followed by any additional arguments
downcast_temporal_array
Downcast an Array to a temporal PrimitiveArray based on its DataType accepts a number of subsequent patterns to match the data type
record_batch
Creates a record batch from literal slice of values, suitable for rapid testing and development.

Structs§

BooleanArray
An array of boolean values
DictionaryArray
An array of dictionary encoded values
FixedSizeBinaryArray
An array of fixed size binary arrays
FixedSizeListArray
An array of [fixed length lists], similar to JSON arrays (e.g. ["A", "B"]).
GenericByteArray
An array of variable length byte arrays
GenericByteViewArray
Variable-size Binary View Layout: An array of variable length bytes views.
GenericListArray
An array of variable length lists, similar to JSON arrays (e.g. ["A", "B", "C"]). This struct specifically represents the list layout. Refer to GenericListViewArray for the list-view layout.
GenericListViewArray
An array of variable length lists, specifically in the list-view layout.
MapArray
An array of key-value maps
NativeAdapter
An optional primitive value
NullArray
An array of null values
PrimitiveArray
An array of primitive values, of type ArrowPrimitiveType
RecordBatch
A two-dimensional batch of column-oriented data with a defined schema.
RecordBatchIterator
Generic implementation of RecordBatchReader that wraps an iterator.
RecordBatchOptions
Options that control the behaviour used when creating a RecordBatch.
RunArray
An array of run-end encoded values
Scalar
A wrapper around a single value Array that implements Datum and indicates compute kernels should treat this array as a scalar value (a single value).
StructArray
An array of structs
TypedDictionaryArray
A DictionaryArray typed on its child values array
TypedRunArray
A RunArray typed typed on its child values array
UnionArray
An array of values of varying types

Traits§

AnyDictionaryArray
A DictionaryArray with the key type erased
Array
An array in the arrow columnar format
ArrayAccessor
A generic trait for accessing the values of an Array
ArrowNativeTypeOp
Trait for ArrowNativeType that adds checked and unchecked arithmetic operations, and totally ordered comparison operations
ArrowNumericType
A subtype of primitive type that represents numeric values.
ArrowPrimitiveType
Trait for primitive values.
BinaryArrayType
A trait for Arrow String Arrays, currently three types are supported:
Datum
A possibly Scalar Array
OffsetSizeTrait
A type that can be used within a variable-size array to encode offset information
RecordBatchReader
Trait for types that can read RecordBatch’s.
RecordBatchWriter
Trait for types that can write RecordBatch’s.
StringArrayType
A trait for Arrow String Arrays, currently three types are supported:

Functions§

make_array
Constructs an array using the input data. Returns a reference-counted Array instance.
new_empty_array
Creates a new empty array
new_null_array
Creates a new array of data_type of length length filled entirely of NULL values

Type Aliases§

ArrayRef
A reference-counted reference to a generic Array
BinaryArray
A GenericBinaryArray of [u8] using i32 offsets
BinaryViewArray
A GenericByteViewArray of [u8]
Date32Array
A PrimitiveArray of days since UNIX epoch stored as i32
Date64Array
A PrimitiveArray of milliseconds since UNIX epoch stored as i64
Decimal128Array
A PrimitiveArray of 128-bit fixed point decimals
Decimal256Array
A PrimitiveArray of 256-bit fixed point decimals
DurationMicrosecondArray
A PrimitiveArray of elapsed durations in microseconds
DurationMillisecondArray
A PrimitiveArray of elapsed durations in milliseconds
DurationNanosecondArray
A PrimitiveArray of elapsed durations in nanoseconds
DurationSecondArray
A PrimitiveArray of elapsed durations in seconds
Float16Array
A PrimitiveArray of f16
Float32Array
A PrimitiveArray of f32
Float64Array
A PrimitiveArray of f64
GenericBinaryArray
A GenericBinaryArray for storing [u8]
GenericStringArray
A GenericByteArray for storing str
Int8Array
A PrimitiveArray of i8
Int8DictionaryArray
A DictionaryArray indexed by i8
Int16Array
A PrimitiveArray of i16
Int16DictionaryArray
A DictionaryArray indexed by i16
Int16RunArray
A RunArray with i16 run ends
Int32Array
A PrimitiveArray of i32
Int32DictionaryArray
A DictionaryArray indexed by i32
Int32RunArray
A RunArray with i32 run ends
Int64Array
A PrimitiveArray of i64
Int64DictionaryArray
A DictionaryArray indexed by i64
Int64RunArray
A RunArray with i64 run ends
IntervalDayTimeArray
A PrimitiveArray of “calendar” intervals in days and milliseconds
IntervalMonthDayNanoArray
A PrimitiveArray of “calendar” intervals in months, days, and nanoseconds.
IntervalYearMonthArray
A PrimitiveArray of “calendar” intervals in whole months
LargeBinaryArray
A GenericBinaryArray of [u8] using i64 offsets
LargeListArray
A GenericListArray of variable size lists, storing offsets as i64.
LargeListViewArray
A GenericListViewArray of variable size lists, storing offsets as i64.
LargeStringArray
A GenericStringArray of str using i64 offsets
ListArray
A GenericListArray of variable size lists, storing offsets as i32.
ListViewArray
A GenericListViewArray of variable size lists, storing offsets as i32.
StringArray
A GenericStringArray of str using i32 offsets
StringViewArray
A GenericByteViewArray that stores utf8 data
Time32MillisecondArray
A PrimitiveArray of milliseconds since midnight stored as i32
Time32SecondArray
A PrimitiveArray of seconds since midnight stored as i32
Time64MicrosecondArray
A PrimitiveArray of microseconds since midnight stored as i64
Time64NanosecondArray
A PrimitiveArray of nanoseconds since midnight stored as i64
TimestampMicrosecondArray
A PrimitiveArray of microseconds since UNIX epoch stored as i64
TimestampMillisecondArray
A PrimitiveArray of milliseconds since UNIX epoch stored as i64
TimestampNanosecondArray
A PrimitiveArray of nanoseconds since UNIX epoch stored as i64
TimestampSecondArray
A PrimitiveArray of seconds since UNIX epoch stored as i64
UInt8Array
A PrimitiveArray of u8
UInt8DictionaryArray
A DictionaryArray indexed by u8
UInt16Array
A PrimitiveArray of u16
UInt16DictionaryArray
A DictionaryArray indexed by u16
UInt32Array
A PrimitiveArray of u32
UInt32DictionaryArray
A DictionaryArray indexed by u32
UInt64Array
A PrimitiveArray of u64
UInt64DictionaryArray
A DictionaryArray indexed by u64