Expand description
A complete, safe, native Rust implementation of Apache Arrow, a cross-language development platform for in-memory data.
Please see the arrow crates.io page for feature flags and tips to improve performance.
Crate Topology
The arrow
project is implemented as multiple sub-crates, which are then re-exported by
this top-level crate.
Crate authors can choose to depend on this top-level crate, or just the sub-crates they need.
The current list of sub-crates is:
- `arrow-arith - arithmetic kernels
arrow-array
- type-safe arrow array abstractionsarrow-buffer
- buffer abstractions for arrow arraysarrow-cast
- cast kernels for arrow arraysarrow-csv
- read/write CSV to arrow formatarrow-data
- the underlying data of arrow arraysarrow-ipc
- read/write IPC to arrow formatarrow-json
- read/write JSON to arrow formatarrow-ord
- ordering kernels for arrow arraysarrow-row
- comparable row formatarrow-schema
- the logical types for arrow arraysarrow-select
- selection kernels for arrow arraysarrow-string
- string kernels for arrow arrays
This list is likely to grow as further functionality is split out from the top-level crate
Some functionality is also distributed independently of this crate:
arrow-flight
- support for Arrow Flight RPCarrow-integration-test
- support for Arrow JSON Test Formatparquet
- support for Apache Parquet
Columnar Format
The array
module provides statically typed implementations of all the array types as defined
by the Arrow Columnar Format
For example, an Int32Array
represents a nullable array of i32
let array = Int32Array::from(vec![Some(1), None, Some(3)]);
assert_eq!(array.len(), 3);
assert_eq!(array.value(0), 1);
assert_eq!(array.is_null(1), true);
let collected: Vec<_> = array.iter().collect();
assert_eq!(collected, vec![Some(1), None, Some(3)]);
assert_eq!(array.values(), [1, 0, 3])
It is also possible to write generic code. For example, the following is generic over all primitively typed arrays:
fn sum<T: ArrowPrimitiveType>(array: &PrimitiveArray<T>) -> T::Native
where
T: ArrowPrimitiveType,
T::Native: Sum
{
array.iter().map(|v| v.unwrap_or_default()).sum()
}
assert_eq!(sum(&Float32Array::from(vec![1.1, 2.9, 3.])), 7.);
assert_eq!(sum(&TimestampNanosecondArray::from(vec![1, 2, 3])), 6);
And the following is generic over all arrays with comparable values
fn min<T: ArrayAccessor>(array: T) -> Option<T::Item>
where
T::Item: Ord
{
ArrayIter::new(array).filter_map(|v| v).min()
}
assert_eq!(min(&Int32Array::from(vec![4, 2, 1, 6])), Some(1));
assert_eq!(min(&StringArray::from(vec!["b", "a", "c"])), Some("a"));
For more examples, consult the arrow_array docs.
Type Erasure / Trait Objects
It is often the case that code wishes to handle any type of array, without necessarily knowing
its concrete type. This use-case is catered for by a combination of Array
and DataType
, with the former providing a type-erased container for
the array, and the latter identifying the concrete type of array.
fn impl_string(array: &StringArray) {}
fn impl_f32(array: &Float32Array) {}
fn impl_dyn(array: &dyn Array) {
match array.data_type() {
DataType::Utf8 => impl_string(array.as_any().downcast_ref().unwrap()),
DataType::Float32 => impl_f32(array.as_any().downcast_ref().unwrap()),
_ => unimplemented!()
}
}
It is also common to want to write a function that returns one of a number of possible
array implementations. ArrayRef
is a type-alias for Arc<dyn Array>
which is frequently used for this purpose
fn parse_to_primitive<'a, T, I>(iter: I) -> PrimitiveArray<T>
where
T: ArrowPrimitiveType,
T::Native: FromStr,
I: IntoIterator<Item=&'a str>,
{
PrimitiveArray::from_iter(iter.into_iter().map(|val| T::Native::from_str(val).ok()))
}
fn parse_strings<'a, I>(iter: I, to_data_type: DataType) -> ArrayRef
where
I: IntoIterator<Item=&'a str>,
{
match to_data_type {
DataType::Int32 => Arc::new(parse_to_primitive::<Int32Type, _>(iter)) as _,
DataType::UInt32 => Arc::new(parse_to_primitive::<UInt32Type, _>(iter)) as _,
_ => unimplemented!()
}
}
let array = parse_strings(["1", "2", "3"], DataType::Int32);
let integers = array.as_any().downcast_ref::<Int32Array>().unwrap();
assert_eq!(integers.values(), [1, 2, 3])
Compute Kernels
The compute
module provides optimised implementations of many common operations,
for example the parse_strings
operation above could also be implemented as follows:
fn parse_strings<'a, I>(iter: I, to_data_type: &DataType) -> Result<ArrayRef>
where
I: IntoIterator<Item=&'a str>,
{
let array = Arc::new(StringArray::from_iter(iter.into_iter().map(Some))) as _;
arrow::compute::cast(&array, to_data_type)
}
let array = parse_strings(["1", "2", "3"], &DataType::UInt32).unwrap();
let integers = array.as_any().downcast_ref::<UInt32Array>().unwrap();
assert_eq!(integers.values(), [1, 2, 3])
This module also implements many common vertical operations:
- All mathematical binary operators, such as
subtract
- All boolean binary operators such as
equality
cast
filter
take
andlimit
sort
- some string operators such as
substring
andlength
let array = Int32Array::from_iter(0..100);
let predicate = gt_scalar(&array, 60).unwrap();
let filtered = filter(&array, &predicate).unwrap();
let expected = Int32Array::from_iter(61..100);
assert_eq!(&expected, as_primitive_array::<Int32Type>(&filtered));
As well as some horizontal operations, such as:
Tabular Representation
It is common to want to group one or more columns together into a tabular representation. This
is provided by RecordBatch
which combines a Schema
and a corresponding list of ArrayRef
.
let col_1 = Arc::new(Int32Array::from_iter([1, 2, 3])) as _;
let col_2 = Arc::new(Float32Array::from_iter([1., 6.3, 4.])) as _;
let batch = RecordBatch::try_from_iter([("col1", col_1), ("col_2", col_2)]).unwrap();
IO
This crate provides readers and writers for various formats to/from RecordBatch
Parquet is published as a separate crate
Memory and Buffers
Advanced users may wish to interact with the underlying buffers of an Array
, for example,
for FFI or high-performance conversion from other formats. This interface is provided by
ArrayData
which stores the Buffer
comprising an Array
, and can be accessed
with Array::data
The APIs for constructing ArrayData
come in safe, and unsafe variants, with the former
performing extensive, but potentially expensive validation to ensure the buffers are well-formed.
An ArrayRef
can be cheaply created from an ArrayData
using make_array
,
or by using the appropriate From
conversion on the concrete Array
implementation.
Safety and Security
Like many crates, this crate makes use of unsafe where prudent. However, it endeavours to be sound. Specifically, it should not be possible to trigger undefined behaviour using safe APIs.
If you think you have found an instance where this is possible, please file a ticket in our issue tracker and it will be triaged and fixed. For more information on arrow’s use of unsafe, see here.
Higher-level Processing
This crate aims to provide reusable, low-level primitives for operating on columnar data. For more sophisticated query processing workloads, consider checking out DataFusion. This orchestrates the primitives exported by this crate into an embeddable query engine, with SQL and DataFrame frontends, and heavily influences this crate’s roadmap.
Re-exports
pub use arrow_csv as csv;
pub use arrow_ipc as ipc;
pub use arrow_json as json;
pub use arrow_row as row;
Modules
offsets
.ArrowError
for representing failures in various Arrow operations.format/Tensor.fbs
.Macros
Array
to a DictionaryArray
based on its DataType
, accepts
a number of subsequent patterns to match the data typeArray
to a PrimitiveArray
based on its DataType
accepts a number of subsequent patterns to match the data type