Expand description
A complete, safe, native Rust implementation of Apache Arrow, a cross-language development platform for in-memory data.
Please see the arrow crates.io page for feature flags and tips to improve performance.
§Columnar Format
The array module provides statically typed implementations of all the array types as defined
by the Arrow Columnar Format
For example, an Int32Array represents a nullable array of i32
let array = Int32Array::from(vec![Some(1), None, Some(3)]);
assert_eq!(array.len(), 3);
assert_eq!(array.value(0), 1);
assert_eq!(array.is_null(1), true);
let collected: Vec<_> = array.iter().collect();
assert_eq!(collected, vec![Some(1), None, Some(3)]);
assert_eq!(array.values(), &[1, 0, 3])It is also possible to write generic code for different concrete types. For example, since the following function is generic over all primitively typed arrays, when invoked the Rust compiler will generate specialized implementations with optimized code for each concrete type.
fn sum<T: ArrowPrimitiveType>(array: &PrimitiveArray<T>) -> T::Native
where
T: ArrowPrimitiveType,
T::Native: Sum
{
array.iter().map(|v| v.unwrap_or_default()).sum()
}
assert_eq!(sum(&Float32Array::from(vec![1.1, 2.9, 3.])), 7.);
assert_eq!(sum(&TimestampNanosecondArray::from(vec![1, 2, 3])), 6);And the following uses ArrayAccessor to implement a generic function
over all arrays with comparable values.
fn min<T: ArrayAccessor>(array: T) -> Option<T::Item>
where
T::Item: Ord
{
ArrayIter::new(array).filter_map(|v| v).min()
}
assert_eq!(min(&Int32Array::from(vec![4, 2, 1, 6])), Some(1));
assert_eq!(min(&StringArray::from(vec!["b", "a", "c"])), Some("a"));For more examples, and details consult the arrow_array docs.
§Type Erasure / Trait Objects
It is common to write code that handles any type of array, without necessarily
knowing its concrete type. This is done using the Array trait and using
DataType to determine the appropriate downcast_ref.
fn impl_string(array: &StringArray) {}
fn impl_f32(array: &Float32Array) {}
fn impl_dyn(array: &dyn Array) {
match array.data_type() {
// downcast `dyn Array` to concrete `StringArray`
DataType::Utf8 => impl_string(array.as_any().downcast_ref().unwrap()),
// downcast `dyn Array` to concrete `Float32Array`
DataType::Float32 => impl_f32(array.as_any().downcast_ref().unwrap()),
_ => unimplemented!()
}
}You can use the AsArray extension trait to facilitate downcasting:
fn impl_string(array: &StringArray) {}
fn impl_f32(array: &Float32Array) {}
fn impl_dyn(array: &dyn Array) {
match array.data_type() {
DataType::Utf8 => impl_string(array.as_string()),
DataType::Float32 => impl_f32(array.as_primitive()),
_ => unimplemented!()
}
}It is also common to want to write a function that returns one of a number of possible
array implementations. ArrayRef is a type-alias for Arc<dyn Array>
which is frequently used for this purpose
fn parse_to_primitive<'a, T, I>(iter: I) -> PrimitiveArray<T>
where
T: ArrowPrimitiveType,
T::Native: FromStr,
I: IntoIterator<Item=&'a str>,
{
PrimitiveArray::from_iter(iter.into_iter().map(|val| T::Native::from_str(val).ok()))
}
fn parse_strings<'a, I>(iter: I, to_data_type: DataType) -> ArrayRef
where
I: IntoIterator<Item=&'a str>,
{
match to_data_type {
DataType::Int32 => Arc::new(parse_to_primitive::<Int32Type, _>(iter)) as _,
DataType::UInt32 => Arc::new(parse_to_primitive::<UInt32Type, _>(iter)) as _,
_ => unimplemented!()
}
}
let array = parse_strings(["1", "2", "3"], DataType::Int32);
let integers = array.as_any().downcast_ref::<Int32Array>().unwrap();
assert_eq!(integers.values(), &[1, 2, 3])§Compute Kernels
The compute module provides optimised implementations of many common operations,
for example the parse_strings operation above could also be implemented as follows:
fn parse_strings<'a, I>(iter: I, to_data_type: &DataType) -> Result<ArrayRef>
where
I: IntoIterator<Item=&'a str>,
{
let array = StringArray::from_iter(iter.into_iter().map(Some));
arrow::compute::cast(&array, to_data_type)
}
let array = parse_strings(["1", "2", "3"], &DataType::UInt32).unwrap();
let integers = array.as_any().downcast_ref::<UInt32Array>().unwrap();
assert_eq!(integers.values(), &[1, 2, 3])This module also implements many common vertical operations:
- All mathematical binary operators, such as
sub - All boolean binary operators such as
equality castfiltertakesort- some string operators such as
substringandlength
let array = Int32Array::from_iter(0..100);
// Create a 32-bit integer scalar (single) value:
let scalar = Int32Array::new_scalar(60);
// find all rows in the array that are greater than 60
let predicate = gt(&array, &scalar).unwrap();
// copy all matching rows into a new array
let filtered = filter(&array, &predicate).unwrap();
let expected = Int32Array::from_iter(61..100);
assert_eq!(&expected, filtered.as_primitive::<Int32Type>());As well as some horizontal operations, such as:
§Tabular Representation
It is common to want to group one or more columns together into a tabular representation. This
is provided by RecordBatch which combines a Schema
and a corresponding list of ArrayRef.
let col_1 = Arc::new(Int32Array::from_iter([1, 2, 3])) as _;
let col_2 = Arc::new(Float32Array::from_iter([1., 6.3, 4.])) as _;
let batch = RecordBatch::try_from_iter([("col1", col_1), ("col_2", col_2)]).unwrap();§Pretty Printing
See the util::pretty module (requires the prettyprint crate feature)
§IO
This crate provides readers and writers for various formats to/from RecordBatch
Parquet is published as a separate crate
§Serde Compatibility
arrow_json::reader::Decoder provides a mechanism to convert arbitrary, serde-compatible
structures into RecordBatch.
Whilst likely less performant than implementing a custom builder, as described in arrow_array::builder, this provides a simple mechanism to get up and running quickly
#[derive(Serialize)]
struct MyStruct {
int32: i32,
string: String,
}
let schema = Schema::new(vec![
Field::new("int32", DataType::Int32, false),
Field::new("string", DataType::Utf8, false),
]);
let rows = vec![
MyStruct{ int32: 5, string: "bar".to_string() },
MyStruct{ int32: 8, string: "foo".to_string() },
];
let mut decoder = ReaderBuilder::new(Arc::new(schema)).build_decoder().unwrap();
decoder.serialize(&rows).unwrap();
let batch = decoder.flush().unwrap().unwrap();
// Expect batch containing two columns
let int32 = batch.column(0).as_primitive::<Int32Type>();
assert_eq!(int32.values(), &[5, 8]);
let string = batch.column(1).as_string::<i32>();
assert_eq!(string.value(0), "bar");
assert_eq!(string.value(1), "foo");§Crate Topology
The arrow project is implemented as multiple sub-crates, which are then re-exported by
this top-level crate.
Crate authors can choose to depend on this top-level crate, or just the sub-crates they need.
The current list of sub-crates is:
arrow-arith- arithmetic kernelsarrow-array- type-safe arrow array abstractionsarrow-buffer- buffer abstractions for arrow arraysarrow-cast- cast kernels for arrow arraysarrow-csv- read/write CSV to arrow formatarrow-data- the underlying data of arrow arraysarrow-ipc- read/write IPC to arrow formatarrow-json- read/write JSON to arrow formatarrow-ord- ordering kernels for arrow arraysarrow-row- comparable row formatarrow-schema- the logical types for arrow arraysarrow-select- selection kernels for arrow arraysarrow-string- string kernels for arrow arrays
Some functionality is also distributed independently of this crate:
arrow-flight- support for Arrow Flight RPCarrow-integration-test- support for Arrow JSON Test Formatparquet- support for Apache Parquet
§Safety and Security
Like many crates, this crate makes use of unsafe where prudent. However, it endeavours to be sound. Specifically, it should not be possible to trigger undefined behaviour using safe APIs.
If you think you have found an instance where this is possible, please file a ticket in our issue tracker and it will be triaged and fixed. For more information on arrow’s use of unsafe, see here.
§Higher-level Processing
This crate aims to provide reusable, low-level primitives for operating on columnar data. For more sophisticated query processing workloads, consider checking out DataFusion. This orchestrates the primitives exported by this crate into an embeddable query engine, with SQL and DataFrame frontends, and heavily influences this crate’s roadmap.
Re-exports§
pub use arrow_csv as csv;csvpub use arrow_ipc as ipc;ipcpub use arrow_json as json;jsonpub use arrow_pyarrow as pyarrow;pyarrowpub use arrow_row as row;
Modules§
- alloc
- Defines the low-level
AllocationAPI for shared memory regions - array
- Statically typed implementations of Arrow Arrays
- buffer
- Types of shared memory region
- compute
- Computation kernels on Arrow Arrays
- datatypes
- Defines the logical data types of Arrow arrays.
- error
- Defines
ArrowErrorfor representing failures in various Arrow operations. - ffi
ffi - Contains declarations to bind to the C Data Interface.
- ffi_
stream ffi - Contains declarations to bind to the C Stream Interface.
- record_
batch - Contains the
RecordBatchtype and associated traits - temporal_
conversions - Conversion methods for dates and times.
- tensor
- Arrow Tensor Type, defined in
format/Tensor.fbs. - util
- Utility functions for working with Arrow data
Macros§
- downcast_
dictionary_ array - Downcast an
Arrayto aDictionaryArraybased on itsDataType, accepts a number of subsequent patterns to match the data type - downcast_
primitive_ array - Downcast an
Arrayto aPrimitiveArraybased on itsDataTypeaccepts a number of subsequent patterns to match the data type
Constants§
- ARROW_
VERSION - Arrow crate version