Expand description
The central type in Apache Arrow are arrays, which are a known-length sequence of values
all having the same type. This crate provides concrete implementations of each type, as
well as an Array
trait that can be used for type-erasure.
§Building an Array
Most Array
implementations can be constructed directly from iterators or Vec
Int32Array::from(vec![1, 2]);
Int32Array::from(vec![Some(1), None]);
Int32Array::from_iter([1, 2, 3, 4]);
Int32Array::from_iter([Some(1), Some(2), None, Some(4)]);
StringArray::from(vec!["foo", "bar"]);
StringArray::from(vec![Some("foo"), None]);
StringArray::from_iter([Some("foo"), None]);
StringArray::from_iter_values(["foo", "bar"]);
ListArray::from_iter_primitive::<Int32Type, _, _>([
Some(vec![Some(1), None, Some(3)]),
None,
Some(vec![])
]);
Additionally ArrayBuilder
implementations can be
used to construct arrays with a push-based interface
// Create a new builder with a capacity of 100
let mut builder = Int16Array::builder(100);
// Append a single primitive value
builder.append_value(1);
// Append a null value
builder.append_null();
// Append a slice of primitive values
builder.append_slice(&[2, 3, 4]);
// Build the array
let array = builder.finish();
assert_eq!(5, array.len());
assert_eq!(2, array.value(2));
assert_eq!(&array.values()[3..5], &[3, 4])
§Low-level API
Internally, arrays consist of one or more shared memory regions backed by a Buffer
,
the number and meaning of which depend on the array’s data type, as documented in
the Arrow specification.
For example, the type Int16Array
represents an array of 16-bit integers and consists of:
- An optional
NullBuffer
identifying any null values - A contiguous
ScalarBuffer<i16>
of values
Similarly, the type StringArray
represents an array of UTF-8 strings and consists of:
- An optional
NullBuffer
identifying any null values - An offsets
OffsetBuffer<i32>
identifying valid UTF-8 sequences within the values buffer - A values
Buffer
of UTF-8 encoded string data
Array constructors such as PrimitiveArray::try_new
provide the ability to cheaply
construct an array from these parts, with functions such as PrimitiveArray::into_parts
providing the reverse operation.
// Create a Int32Array from Vec without copying
let array = Int32Array::new(vec![1, 2, 3].into(), None);
assert_eq!(array.values(), &[1, 2, 3]);
assert_eq!(array.null_count(), 0);
// Create a StringArray from parts
let offsets = OffsetBuffer::new(vec![0, 5, 10].into());
let array = StringArray::new(offsets, b"helloworld".into(), None);
let values: Vec<_> = array.iter().map(|x| x.unwrap()).collect();
assert_eq!(values, &["hello", "world"]);
As Buffer
, and its derivatives, can be created from Vec
without copying, this provides
an efficient way to not only interoperate with other Rust code, but also implement kernels
optimised for the arrow data layout - e.g. by handling buffers instead of values.
§Zero-Copy Slicing
Given an Array
of arbitrary length, it is possible to create an owned slice of this
data. Internally this just increments some ref-counts, and so is incredibly cheap
let array = Int32Array::from_iter([1, 2, 3]);
// Slice with offset 1 and length 2
let sliced = array.slice(1, 2);
assert_eq!(sliced.values(), &[2, 3]);
§Downcasting an Array
Arrays are often passed around as a dynamically typed &dyn Array
or ArrayRef
.
For example, RecordBatch
stores columns as ArrayRef
.
Whilst these arrays can be passed directly to the compute
, csv
, json
, etc… APIs,
it is often the case that you wish to interact with the concrete arrays directly.
This requires downcasting to the concrete type of the array:
// Safely downcast an `Array` to an `Int32Array` and compute the sum
// using native i32 values
fn sum_int32(array: &dyn Array) -> i32 {
let integers: &Int32Array = array.as_any().downcast_ref().unwrap();
integers.iter().map(|val| val.unwrap_or_default()).sum()
}
// Safely downcasts the array to a `Float32Array` and returns a &[f32] view of the data
// Note: the values for positions corresponding to nulls will be arbitrary (but still valid f32)
fn as_f32_slice(array: &dyn Array) -> &[f32] {
array.as_any().downcast_ref::<Float32Array>().unwrap().values()
}
The cast::AsArray
extension trait can make this more ergonomic
fn as_f32_slice(array: &dyn Array) -> &[f32] {
array.as_primitive::<Float32Type>().values()
}
§Alternatives to ChunkedArray Support
The Rust implementation does not provide the ChunkedArray abstraction implemented by the Python and C++ Arrow implementations. The recommended alternative is to use one of the following:
Vec<ArrayRef>
a simple, eager version of aChunkedArray
impl Iterator<Item=ArrayRef>
a lazy version of aChunkedArray
impl Stream<Item=ArrayRef>
a lazy async version of aChunkedArray
Similar patterns can be applied at the RecordBatch
level. For example, DataFusion makes
extensive use of RecordBatchStream.
This approach integrates well into the Rust ecosystem, simplifies the implementation and encourages the use of performant lazy and async patterns.
use std::sync::Arc;
use arrow_array::{ArrayRef, Float32Array, RecordBatch, StringArray};
use arrow_array::cast::AsArray;
use arrow_array::types::Float32Type;
use arrow_schema::DataType;
let batches = [
RecordBatch::try_from_iter(vec![
("label", Arc::new(StringArray::from(vec!["A", "B", "C"])) as ArrayRef),
("value", Arc::new(Float32Array::from(vec![0.1, 0.2, 0.3])) as ArrayRef),
]).unwrap(),
RecordBatch::try_from_iter(vec![
("label", Arc::new(StringArray::from(vec!["D", "E"])) as ArrayRef),
("value", Arc::new(Float32Array::from(vec![0.4, 0.5])) as ArrayRef),
]).unwrap(),
];
let labels: Vec<&str> = batches
.iter()
.flat_map(|batch| batch.column(0).as_string::<i32>())
.map(Option::unwrap)
.collect();
let values: Vec<f32> = batches
.iter()
.flat_map(|batch| batch.column(1).as_primitive::<Float32Type>().values())
.copied()
.collect();
assert_eq!(labels, ["A", "B", "C", "D", "E"]);
assert_eq!(values, [0.1, 0.2, 0.3, 0.4, 0.5]);
Modules§
- array
- The concrete array definitions
- builder
- Defines push-based APIs for constructing arrays
- cast
- Defines helper functions for downcasting
dyn Array
to concrete types - iterator
- Idiomatic iterators for
Array
- run_
iterator - Idiomatic iterator for
RunArray
- temporal_
conversions - Conversion methods for dates and times.
- timezone
- Timezone for timestamp arrays
- types
- Zero-sized types used to parameterize generic array implementations
Macros§
- create_
array - Creates an array from a literal slice of values, suitable for rapid testing and development.
- downcast_
dictionary_ array - Downcast an
Array
to aDictionaryArray
based on itsDataType
, accepts a number of subsequent patterns to match the data type - downcast_
integer - Given one or more expressions evaluating to an integer
DataType
invokes the provided macrom
with the corresponding integerArrowPrimitiveType
, followed by any additional arguments - downcast_
integer_ array - Given one or more expressions evaluating to an integer
PrimitiveArray
invokes the provided macro with the corresponding array, along with match statements for any non integer array types - downcast_
primitive - Given one or more expressions evaluating to primitive
DataType
invokes the provided macrom
with the correspondingArrowPrimitiveType
, followed by any additional arguments - downcast_
primitive_ array - Downcast an
Array
to aPrimitiveArray
based on itsDataType
accepts a number of subsequent patterns to match the data type - downcast_
run_ array - Downcast an
Array
to aRunArray
based on itsDataType
, accepts a number of subsequent patterns to match the data type - downcast_
run_ end_ index - Given one or more expressions evaluating to an integer
DataType
invokes the provided macrom
with the corresponding integerRunEndIndexType
, followed by any additional arguments - downcast_
temporal - Given one or more expressions evaluating to primitive
DataType
invokes the provided macrom
with the correspondingArrowPrimitiveType
, followed by any additional arguments - downcast_
temporal_ array - Downcast an
Array
to a temporalPrimitiveArray
based on itsDataType
accepts a number of subsequent patterns to match the data type - record_
batch - Creates a record batch from literal slice of values, suitable for rapid testing and development.
Structs§
- Boolean
Array - An array of boolean values
- Dictionary
Array - An array of dictionary encoded values
- Fixed
Size Binary Array - An array of fixed size binary arrays
- Fixed
Size List Array - An array of [fixed length lists], similar to JSON arrays
(e.g.
["A", "B"]
). - Generic
Byte Array - An array of variable length byte arrays
- Generic
Byte View Array - Variable-size Binary View Layout: An array of variable length bytes views.
- Generic
List Array - An array of variable length lists, similar to JSON arrays
(e.g.
["A", "B", "C"]
). This struct specifically represents the list layout. Refer toGenericListViewArray
for the list-view layout. - Generic
List View Array - An array of variable length lists, specifically in the list-view layout.
- MapArray
- An array of key-value maps
- Native
Adapter - An optional primitive value
- Null
Array - An array of null values
- Primitive
Array - An array of primitive values, of type
ArrowPrimitiveType
- Record
Batch - A two-dimensional batch of column-oriented data with a defined schema.
- Record
Batch Iterator - Generic implementation of RecordBatchReader that wraps an iterator.
- Record
Batch Options - Options that control the behaviour used when creating a
RecordBatch
. - RunArray
- An array of run-end encoded values
- Scalar
- A wrapper around a single value
Array
that implementsDatum
and indicates compute kernels should treat this array as a scalar value (a single value). - Struct
Array - An array of structs
- Typed
Dictionary Array - A
DictionaryArray
typed on its child values array - Typed
RunArray - A
RunArray
typed typed on its child values array - Union
Array - An array of values of varying types
Traits§
- AnyDictionary
Array - A
DictionaryArray
with the key type erased - Array
- An array in the arrow columnar format
- Array
Accessor - A generic trait for accessing the values of an
Array
- Arrow
Native Type Op - Trait for
ArrowNativeType
that adds checked and unchecked arithmetic operations, and totally ordered comparison operations - Arrow
Numeric Type - A subtype of primitive type that represents numeric values.
- Arrow
Primitive Type - Trait for primitive values.
- Binary
Array Type - A trait for Arrow String Arrays, currently three types are supported:
- Datum
- A possibly
Scalar
Array
- Offset
Size Trait - A type that can be used within a variable-size array to encode offset information
- Record
Batch Reader - Trait for types that can read
RecordBatch
’s. - Record
Batch Writer - Trait for types that can write
RecordBatch
’s. - String
Array Type - A trait for Arrow String Arrays, currently three types are supported:
Functions§
- make_
array - Constructs an array using the input
data
. Returns a reference-countedArray
instance. - new_
empty_ array - Creates a new empty array
- new_
null_ array - Creates a new array of
data_type
of lengthlength
filled entirely ofNULL
values
Type Aliases§
- Array
Ref - A reference-counted reference to a generic
Array
- Binary
Array - A
GenericBinaryArray
of[u8]
usingi32
offsets - Binary
View Array - A
GenericByteViewArray
of[u8]
- Date32
Array - A
PrimitiveArray
of days since UNIX epoch stored asi32
- Date64
Array - A
PrimitiveArray
of milliseconds since UNIX epoch stored asi64
- Decimal128
Array - A
PrimitiveArray
of 128-bit fixed point decimals - Decimal256
Array - A
PrimitiveArray
of 256-bit fixed point decimals - Duration
Microsecond Array - A
PrimitiveArray
of elapsed durations in microseconds - Duration
Millisecond Array - A
PrimitiveArray
of elapsed durations in milliseconds - Duration
Nanosecond Array - A
PrimitiveArray
of elapsed durations in nanoseconds - Duration
Second Array - A
PrimitiveArray
of elapsed durations in seconds - Float16
Array - A
PrimitiveArray
off16
- Float32
Array - A
PrimitiveArray
off32
- Float64
Array - A
PrimitiveArray
off64
- Generic
Binary Array - A
GenericBinaryArray
for storing[u8]
- Generic
String Array - A
GenericByteArray
for storingstr
- Int8
Array - A
PrimitiveArray
ofi8
- Int8
Dictionary Array - A
DictionaryArray
indexed byi8
- Int16
Array - A
PrimitiveArray
ofi16
- Int16
Dictionary Array - A
DictionaryArray
indexed byi16
- Int16
RunArray - A
RunArray
withi16
run ends - Int32
Array - A
PrimitiveArray
ofi32
- Int32
Dictionary Array - A
DictionaryArray
indexed byi32
- Int32
RunArray - A
RunArray
withi32
run ends - Int64
Array - A
PrimitiveArray
ofi64
- Int64
Dictionary Array - A
DictionaryArray
indexed byi64
- Int64
RunArray - A
RunArray
withi64
run ends - Interval
DayTime Array - A
PrimitiveArray
of “calendar” intervals in days and milliseconds - Interval
Month DayNano Array - A
PrimitiveArray
of “calendar” intervals in months, days, and nanoseconds. - Interval
Year Month Array - A
PrimitiveArray
of “calendar” intervals in whole months - Large
Binary Array - A
GenericBinaryArray
of[u8]
usingi64
offsets - Large
List Array - A
GenericListArray
of variable size lists, storing offsets asi64
. - Large
List View Array - A
GenericListViewArray
of variable size lists, storing offsets asi64
. - Large
String Array - A
GenericStringArray
ofstr
usingi64
offsets - List
Array - A
GenericListArray
of variable size lists, storing offsets asi32
. - List
View Array - A
GenericListViewArray
of variable size lists, storing offsets asi32
. - String
Array - A
GenericStringArray
ofstr
usingi32
offsets - String
View Array - A
GenericByteViewArray
that stores utf8 data - Time32
Millisecond Array - A
PrimitiveArray
of milliseconds since midnight stored asi32
- Time32
Second Array - A
PrimitiveArray
of seconds since midnight stored asi32
- Time64
Microsecond Array - A
PrimitiveArray
of microseconds since midnight stored asi64
- Time64
Nanosecond Array - A
PrimitiveArray
of nanoseconds since midnight stored asi64
- Timestamp
Microsecond Array - A
PrimitiveArray
of microseconds since UNIX epoch stored asi64
- Timestamp
Millisecond Array - A
PrimitiveArray
of milliseconds since UNIX epoch stored asi64
- Timestamp
Nanosecond Array - A
PrimitiveArray
of nanoseconds since UNIX epoch stored asi64
- Timestamp
Second Array - A
PrimitiveArray
of seconds since UNIX epoch stored asi64
- UInt8
Array - A
PrimitiveArray
ofu8
- UInt8
Dictionary Array - A
DictionaryArray
indexed byu8
- UInt16
Array - A
PrimitiveArray
ofu16
- UInt16
Dictionary Array - A
DictionaryArray
indexed byu16
- UInt32
Array - A
PrimitiveArray
ofu32
- UInt32
Dictionary Array - A
DictionaryArray
indexed byu32
- UInt64
Array - A
PrimitiveArray
ofu64
- UInt64
Dictionary Array - A
DictionaryArray
indexed byu64