Crate arrow[−][src]
Expand description
A native Rust implementation of Apache Arrow, a cross-language development platform for in-memory data.
DataType
Every Array
in this crate has an associated DataType
,
that specifies how its data is layed in memory and represented.
Thus, a central enum of this crate is DataType
, that contains the set of valid
DataTypes in the specification. For example, DataType::Utf8
.
Array
The central trait of this package is the dynamically-typed Array
that
represents a fixed-sized, immutable, Send + Sync Array of nullable elements. An example of such an array is UInt32Array
.
One way to think about an arrow Array
is a Arc<[Option<T>; len]>
where T can be anything ranging from an integer to a string, or even
another Array
.
Arrays
have len()
, data_type()
, and the nullability of each of its elements,
can be obtained via is_null(index)
. To downcast an Array
to a specific implementation, you can use
use arrow::array::{Array, UInt32Array}; let array = UInt32Array::from(vec![Some(1), None, Some(3)]); assert_eq!(array.len(), 3); assert_eq!(array.value(0), 1); assert_eq!(array.is_null(1), true);
To make the array dynamically typed, we wrap it in an Arc
:
use arrow::datatypes::DataType; use arrow::array::{UInt32Array, ArrayRef}; let array: ArrayRef = Arc::new(array); assert_eq!(array.len(), 3); // array.value() is not available in the dynamically-typed version assert_eq!(array.is_null(1), true); assert_eq!(array.data_type(), &DataType::UInt32);
to downcast, use as_any()
:
let array = array.as_any().downcast_ref::<UInt32Array>().unwrap(); assert_eq!(array.value(0), 1);
Memory and Buffers
Data in Array
is stored in ArrayData
, that in turn
is a collection of other ArrayData
and Buffers
.
Buffers
is the central struct that array implementations use keep allocated memory and pointers.
The MutableBuffer
is the mutable counter-part ofBuffer
.
These are the lowest abstractions of this crate, and are used throughout the crate to
efficiently allocate, write, read and deallocate memory.
Field, Schema and RecordBatch
Field
is a struct that contains an array’s metadata (datatype and whether its values
can be null), and a name. Schema
is a vector of fields with optional metadata.
Together, they form the basis of a schematic representation of a group of Arrays
.
In fact, RecordBatch
is a struct with a Schema
and a vector of
Array
s, all with the same len
. A record batch is the highest order struct that this crate currently offers
and is broadly used to represent a table where each column in an Array
.
Compute
This crate offers many operations (called kernels) to operate on Array
s, that you can find at compute::kernels.
It has both vertical and horizontal operations, and some of them have an SIMD implementation.
Status
This crate has most of the implementation of the arrow specification. Specifically, it supports the following types:
- All arrow primitive types, such as
Int32Array
,BooleanArray
andFloat64Array
. - All arrow variable length types, such as
StringArray
andBinaryArray
- All composite types such as
StructArray
andListArray
- Dictionary types
DictionaryArray
This crate also implements many common vertical operations:
- all mathematical binary operators, such as
subtract
- all boolean binary operators such as
equality
cast
filter
take
andlimit
sort
- some string operators such as
substring
andlength
as well as some horizontal operations, such as
Finally, this crate implements some readers and writers to different formats:
The parquet implementation is on a separate crate
Modules
alloc | Defines memory-related functions, such as allocate/deallocate/reallocate memory regions, cache and allocation alignments. |
array | The central type in Apache Arrow are arrays, represented
by the |
bitmap | Defines a bitmap, which is used to track which values in an Arrow array are null. This is called a “validity bitmap” in the Arrow documentation. |
buffer | This module contains two main structs: Buffer and MutableBuffer. A buffer represents
a contiguous memory region that can be shared via |
compute | Computation kernels on Arrow Arrays |
csv | Transfer data between the Arrow memory format and CSV (comma-separated values). |
datatypes | Defines the logical data types of Arrow arrays. |
error | Defines |
ffi | Contains declarations to bind to the C Data Interface. |
ipc | |
json | Transfer data between the Arrow memory format and JSON
line-delimited records. See the module level documentation for the
|
record_batch | A two-dimensional batch of column-oriented data with a defined schema. |
temporal_conversions | Conversion methods for dates and times. |
tensor | Arrow Tensor Type, defined in
|
util |