Skip to main content

serde_arrow/
lib.rs

1//! # `serde_arrow` - convert sequences Rust objects to / from arrow arrays
2//!
3//! The arrow in-memory format is a powerful way to work with data frame like structures. However,
4//! the API of the underlying Rust crates can be at times cumbersome to use due to the statically
5//! typed nature of Rust. `serde_arrow`, offers a simple way to convert Rust objects into Arrow
6//! arrays and back. `serde_arrow` relies on [Serde](https://serde.rs) to interpret Rust objects.
7//! Therefore, adding support for `serde_arrow` to custom types is as easy as using Serde's derive
8//! macros.
9//!
10//! `serde_arrow` mainly targets the [`arrow`](https://github.com/apache/arrow-rs) crate, but also
11//! supports the deprecated [`arrow2`](https://github.com/jorgecarleitao/arrow2) crate. The arrow
12//! implementations can be selected via [features](#features).
13//!
14//! `serde_arrow` relies on a schema to translate between Rust and Arrow as their type systems do
15//! not directly match. The schema is expressed as a collection of Arrow fields with additional
16//! metadata describing the arrays. E.g., to convert a vector of Rust strings representing
17//! timestamps to an arrow `Timestamp` array, the schema should contain a field with data type
18//! `Timestamp`. `serde_arrow` supports to derive the schema from the data or the Rust types
19//! themselves via schema tracing, but does not require it. It is always possible to specify the
20//! schema manually. See the [`schema` module][schema] and [`SchemaLike`][schema::SchemaLike] for
21//! further details.
22//!
23#![cfg_attr(
24    all(has_arrow, has_arrow2),
25    doc = r#"
26## Overview
27
28| Operation        | [`arrow-*`](#features)                                            | [`arrow2-*`](#features)                             | `marrow`                                            |
29|:-----------------|:------------------------------------------------------------------|:----------------------------------------------------|:----------------------------------------------------|
30| Rust to Arrow    | [`to_record_batch`], [`to_arrow`]                                 | [`to_arrow2`]                                       | [`to_marrow`]                                       |
31| Arrow to Rust    | [`from_record_batch`], [`from_arrow`]                             | [`from_arrow2`]                                     | [`from_marrow`]                                     |
32| [`ArrayBuilder`] | [`ArrayBuilder::from_arrow`]                                      | [`ArrayBuilder::from_arrow2`]                       | [`ArrayBuilder::from_marrow`]                       |
33| [`Serializer`]   | [`ArrayBuilder::from_arrow`] + [`Serializer::new`]                | [`ArrayBuilder::from_arrow2`] + [`Serializer::new`] | [`ArrayBuilder::from_marrow`] + [`Serializer::new`] |
34| [`Deserializer`] | [`Deserializer::from_record_batch`], [`Deserializer::from_arrow`] | [`Deserializer::from_arrow2`]                       | [`Deserializer::from_marrow`]                       |
35"#
36)]
37//!
38//! See also:
39//!
40//! - the [quickstart guide][_impl::docs::quickstart] for more examples of how to use this package
41//! - the [status summary][_impl::docs::status] for an overview over the supported Arrow and Rust
42//!   constructs
43//!
44//! ## Example
45//!
46//! ```rust
47//! # use serde::{Deserialize, Serialize};
48//! # #[cfg(has_arrow)]
49//! # fn main() -> serde_arrow::Result<()> {
50//! # use serde_arrow::_impl::arrow;
51//! use arrow::datatypes::FieldRef;
52//! use serde_arrow::schema::{SchemaLike, TracingOptions};
53//!
54//! ##[derive(Serialize, Deserialize)]
55//! struct Record {
56//!     a: f32,
57//!     b: i32,
58//! }
59//!
60//! let records = vec![
61//!     Record { a: 1.0, b: 1 },
62//!     Record { a: 2.0, b: 2 },
63//!     Record { a: 3.0, b: 3 },
64//! ];
65//!
66//! // Determine Arrow schema
67//! let fields = Vec::<FieldRef>::from_type::<Record>(TracingOptions::default())?;
68//!
69//! // Build the record batch
70//! let batch = serde_arrow::to_record_batch(&fields, &records)?;
71//! # Ok(())
72//! # }
73//! # #[cfg(not(has_arrow))]
74//! # fn main() { }
75//! ```
76//!
77//! The `RecordBatch` can then be written to disk, e.g., as parquet using the [`ArrowWriter`] from
78//! the [`parquet`] crate.
79//!
80//! [`ArrowWriter`]:
81//!     https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html
82//! [`parquet`]: https://docs.rs/parquet/latest/parquet/
83//!
84//! # Features:
85//!
86//! The version of `arrow` or `arrow2` used can be selected via features. Per default no arrow
87//! implementation is used. In that case only the base features of `serde_arrow` are available.
88//!
89//! The `arrow-*` and `arrow2-*` feature groups are compatible with each other. I.e., it is possible
90//! to use `arrow` and `arrow2` together. Within each group the highest version is selected, if
91//! multiple features are activated. E.g, when selecting  `arrow2-0-16` and `arrow2-0-17`,
92//! `arrow2=0.17` will be used.
93//!
94//! Note that because the highest version is selected, the features are not additive. In particular,
95//! it is not possible to use `serde_arrow::to_arrow` for multiple different `arrow` versions at the
96//! same time. See the next section for how to use `serde_arrow` in library code.
97//!
98//! Available features:
99//!
100//! | Arrow Feature | Arrow Version |
101//! |---------------|---------------|
102// arrow-version:insert: //! | `arrow-{version}`    | `arrow={version}`    |
103//! | `arrow-59`    | `arrow=59`    |
104//! | `arrow-58`    | `arrow=58`    |
105//! | `arrow-57`    | `arrow=57`    |
106//! | `arrow-56`    | `arrow=56`    |
107//! | `arrow-55`    | `arrow=55`    |
108//! | `arrow-54`    | `arrow=54`    |
109//! | `arrow-53`    | `arrow=53`    |
110//! | `arrow-52`    | `arrow=52`    |
111//! | `arrow-51`    | `arrow=51`    |
112//! | `arrow-50`    | `arrow=50`    |
113//! | `arrow-49`    | `arrow=49`    |
114//! | `arrow-48`    | `arrow=48`    |
115//! | `arrow-47`    | `arrow=47`    |
116//! | `arrow-46`    | `arrow=46`    |
117//! | `arrow-45`    | `arrow=45`    |
118//! | `arrow-44`    | `arrow=44`    |
119//! | `arrow-43`    | `arrow=43`    |
120//! | `arrow-42`    | `arrow=42`    |
121//! | `arrow-41`    | `arrow=41`    |
122//! | `arrow-40`    | `arrow=40`    |
123//! | `arrow-39`    | `arrow=39`    |
124//! | `arrow-38`    | `arrow=38`    |
125//! | `arrow-37`    | `arrow=37`    |
126//! | `arrow2-0-17` | `arrow2=0.17` |
127//! | `arrow2-0-16` | `arrow2=0.16` |
128//!
129//! # Usage in  libraries
130//!
131//! In libraries, it is not recommended to use the `arrow` and `arrow2` functions directly. Rather
132//! it is recommended to rely on the [`marrow`] based functionality, as the features of [`marrow`]
133//! are designed to be strictly additive.
134//!
135//! For example to build a record batch, first build the corresponding marrow types and then use
136//! them to build the record batch:
137//!
138//! ```rust
139//! # use serde::{Deserialize, Serialize};
140//! # fn main() -> serde_arrow::Result<()> {
141//! # #[cfg(has_arrow)] {
142//! # use serde_arrow::_impl::arrow;
143//! # use std::sync::Arc;
144//! # use serde_arrow::schema::{SchemaLike, TracingOptions};
145//! #
146//! # #[derive(Serialize, Deserialize)]
147//! # struct Record {
148//! #     a: f32,
149//! #     b: i32,
150//! # }
151//! #
152//! # let records = vec![
153//! #     Record { a: 1.0, b: 1 },
154//! #     Record { a: 2.0, b: 2 },
155//! #     Record { a: 3.0, b: 3 },
156//! # ];
157//! #
158//! // Determine Arrow schema
159//! let fields = Vec::<marrow::datatypes::Field>::from_type::<Record>(TracingOptions::default())?;
160//!
161//! // Build the marrow arrays
162//! let arrays = serde_arrow::to_marrow(&fields, &records)?;
163//!
164//! // Build the record batch
165//! let arrow_fields = fields.iter()
166//!     .map(arrow::datatypes::Field::try_from)
167//!     .collect::<Result<Vec<_>, _>>()?;
168//!
169//! let arrow_arrays = arrays.into_iter()
170//!     .map(arrow::array::ArrayRef::try_from)
171//!     .collect::<Result<Vec<_>, _>>()?;
172//!
173//! let record_batch = arrow::array::RecordBatch::try_new(
174//!     Arc::new(arrow::datatypes::Schema::new(arrow_fields)),
175//!     arrow_arrays,
176//! );
177//! # }
178//! # Ok(())
179//! # }
180//! ```
181
182// be more forgiving without any active implementation
183#[cfg_attr(not(any(has_arrow, has_arrow2)), allow(unused))]
184mod internal;
185
186/// *Internal. Do not use*
187///
188/// This module is an internal implementation detail and not subject to any
189/// compatibility promises. It re-exports the  arrow impls selected via features
190/// to allow usage in doc tests or benchmarks.
191///
192#[rustfmt::skip]
193pub mod _impl {
194
195    #[cfg(has_arrow2_0_17)]
196    #[doc(hidden)]
197    pub use arrow2_0_17 as arrow2;
198
199    #[cfg(has_arrow2_0_16)]
200    pub use arrow2_0_16 as arrow2;
201
202    #[allow(unused)]
203    macro_rules! build_arrow_crate {
204        ($arrow_array:ident, $arrow_schema:ident) => {
205            /// A "fake" arrow crate re-exporting the relevant definitions of the
206            /// used arrow-* subcrates
207            #[doc(hidden)]
208            pub mod arrow {
209                /// The raw arrow packages
210                pub mod _raw {
211                    pub use {$arrow_array as array, $arrow_schema as schema};
212                }
213                pub mod array {
214                    pub use $arrow_array::{RecordBatch, array::{Array, ArrayRef}};
215                }
216                pub mod datatypes {
217                    pub use $arrow_schema::{DataType, Field, FieldRef, Schema, TimeUnit};
218                }
219                pub mod error {
220                    pub use $arrow_schema::ArrowError;
221                }
222            }
223        };
224    }
225
226    // arrow-version:insert:     #[cfg(has_arrow_{version})] build_arrow_crate!(arrow_array_{version}, arrow_schema_{version});
227    #[cfg(has_arrow_59)] build_arrow_crate!(arrow_array_59, arrow_schema_59);
228    #[cfg(has_arrow_58)] build_arrow_crate!(arrow_array_58, arrow_schema_58);
229    #[cfg(has_arrow_57)] build_arrow_crate!(arrow_array_57, arrow_schema_57);
230    #[cfg(has_arrow_56)] build_arrow_crate!(arrow_array_56, arrow_schema_56);
231    #[cfg(has_arrow_55)] build_arrow_crate!(arrow_array_55, arrow_schema_55);
232    #[cfg(has_arrow_54)] build_arrow_crate!(arrow_array_54, arrow_schema_54);
233    #[cfg(has_arrow_53)] build_arrow_crate!(arrow_array_53, arrow_schema_53);
234    #[cfg(has_arrow_52)] build_arrow_crate!(arrow_array_52, arrow_schema_52);
235    #[cfg(has_arrow_51)] build_arrow_crate!(arrow_array_51, arrow_schema_51);
236    #[cfg(has_arrow_50)] build_arrow_crate!(arrow_array_50, arrow_schema_50);
237    #[cfg(has_arrow_49)] build_arrow_crate!(arrow_array_49, arrow_schema_49);
238    #[cfg(has_arrow_48)] build_arrow_crate!(arrow_array_48, arrow_schema_48);
239    #[cfg(has_arrow_47)] build_arrow_crate!(arrow_array_47, arrow_schema_47);
240    #[cfg(has_arrow_46)] build_arrow_crate!(arrow_array_46, arrow_schema_46);
241    #[cfg(has_arrow_45)] build_arrow_crate!(arrow_array_45, arrow_schema_45);
242    #[cfg(has_arrow_44)] build_arrow_crate!(arrow_array_44, arrow_schema_44);
243    #[cfg(has_arrow_43)] build_arrow_crate!(arrow_array_43, arrow_schema_43);
244    #[cfg(has_arrow_42)] build_arrow_crate!(arrow_array_42, arrow_schema_42);
245    #[cfg(has_arrow_41)] build_arrow_crate!(arrow_array_41, arrow_schema_41);
246    #[cfg(has_arrow_40)] build_arrow_crate!(arrow_array_40, arrow_schema_40);
247    #[cfg(has_arrow_39)] build_arrow_crate!(arrow_array_39, arrow_schema_39);
248    #[cfg(has_arrow_38)] build_arrow_crate!(arrow_array_38, arrow_schema_38);
249    #[cfg(has_arrow_37)] build_arrow_crate!(arrow_array_37, arrow_schema_37);
250
251    /// Documentation
252    pub mod docs {
253        #[doc(hidden)]
254        pub mod defs;
255
256        pub mod quickstart;
257
258        #[doc = include_str!("../Status.md")]
259        #[cfg(not(doctest))]
260        pub mod status {}
261    }
262
263    // Reexport for tests
264    #[doc(hidden)]
265    pub use crate::internal::{
266        error::{PanicOnError, PanicOnErrorError},
267        serialization::array_builder::ArrayBuilder,
268    };
269}
270
271#[cfg(all(test, has_arrow, has_arrow2))]
272mod test_with_arrow;
273
274#[cfg(test)]
275mod test;
276
277pub use crate::internal::error::{Error, ErrorKind, Result};
278
279pub use crate::internal::deserializer::Deserializer;
280pub use crate::internal::serializer::Serializer;
281
282pub use crate::internal::array_builder::ArrayBuilder;
283
284#[cfg(has_arrow)]
285mod arrow_impl;
286
287#[cfg(has_arrow)]
288pub use arrow_impl::{from_arrow, from_record_batch, to_arrow, to_record_batch};
289
290#[cfg(has_arrow2)]
291mod arrow2_impl;
292
293#[cfg(has_arrow2)]
294pub use arrow2_impl::{from_arrow2, to_arrow2};
295
296#[deny(missing_docs)]
297mod marrow_impl;
298
299pub use marrow_impl::{from_marrow, to_marrow};
300
301#[deny(missing_docs)]
302/// Helpers that may be useful when using `serde_arrow`
303pub mod utils {
304    pub use crate::internal::utils::{Item, Items};
305}
306
307#[deny(missing_docs)]
308/// Deserialization of items
309pub mod deserializer {
310    pub use crate::internal::deserializer::{DeserializerItem, DeserializerIterator};
311}
312
313/// Type mapping between Rust, Serde, and Arrow
314///
315/// `serde_arrow` bridges three distinct type systems: Rust types, the actual
316/// types in your Rust code (`Vec<T>`, structs, enums, etc.),
317/// [Serde data model][serde-model], the abstract representation Serde uses
318/// during serialization, and [Arrow types][arrow-model], the columnar data
319/// types defined by Apache Arrow. To convert between thse type systems ,
320/// `serde_arrow` requires schema information as a list of Arrow fields with
321/// additional metadata. See [`SchemaLike`][crate::schema::SchemaLike] for
322/// details on how to specify the schema.
323///
324///
325/// In most cases, `serde_arrow` expects data as a sequence of records:
326///
327/// ```rust
328/// # struct Record { f0: i32, f1: i32 }
329/// # let (v0, v1, v2, v3) = (0_i32, 1_i32, 2_i32, 3_i32);
330/// vec![
331///     Record { f0: v0, f1: v1 },
332///     Record { f0: v2, f1: v3 },
333///     // ..
334/// ]
335/// # ;
336/// ```
337///
338/// The outer container must be one of these [Serde data types][serde-model]:
339///
340/// | Serde data type | Example Rust types | Comment |
341/// |---|---|---|
342/// |`seq` | [`Vec<T>`][std::vec::Vec], `&[T]` | variable-sized sequences |
343/// | `tuple`, `tuple_struct`, `tuple_variant` |  `(T0, T1)`, `[T; N]`, `struct S(T0, T1)`) | fixed-sized sequences|
344/// | `newtype_struct`, `newtype_variant` | `struct S(T)`, `enum E { V(T) }` | wrappers around the preceding types |
345///
346/// Each record must be one of these Serde data types:
347///
348/// | Serde data type | Example Rust types | Comment |
349/// |---|---|---|
350/// | `struct`, `struct_variant` | `struct S { f0: T0, f1: T1 }` | named fields |
351/// | `map` | [`HashMap<K, V>`][std::collections::HashMap], [`BTreeMap<K, V>`][std::collections::BTreeMap] | key-value pairs |
352/// | `seq`, `tuple`, `tuple_struct`, `tuple_variant` | `(T0, T1)`, `[T; N]` |  ordered fields |
353/// | `newtype_struct`, `newtype_variant` | `struct S(T)` | wrappers around the preceding types |
354///
355/// Schema fields and struct fields do not have to be specified in the same
356/// order, but matching order improves lookup performance. Missing schema
357/// fields are serialized as null. Extra struct fields are ignored. Maps follow
358/// the same semantics.
359///
360/// The following table shows how [Serde data types][serde-model], Rust types,
361/// and [Arrow types][arrow-model] map to each other:
362///
363///
364/// | Serde data type | Example Rust types | Default Arrow type |
365/// |------------------|-------------------|------------|
366/// | `unit` | `()` | `Null` |
367/// | `bool` | `bool` | `Boolean` |
368/// | `i8`, `i16`, `i32`, `i64` | `i8`, `i16`, `i32`, `i64` | `Int8`, `Int16`, `Int32`, `Int64` |
369/// | `u8`, `u16`, `u32`, `u64` | `u8`, `u16`, `u32`, `u64` | `UInt8`, `UInt16`, `UInt32`, `UInt64` |
370/// | `char` | `char` | `UInt32` |
371/// | `bytes` | | `LargeBinary` |
372/// | `f32`, `f64` | `f32`, `f64` | `Float32`, `Float64` |
373/// | `str` | `str`, `String`, `&str` | `LargeUtf8` |
374/// | `seq` | `Vec<T>`, `&[T]` | `LargeList` |
375/// | `struct`, `tuple`, `tuple_struct` | `struct S { .. }`, `(T0, T1)` | `Struct` |
376/// | `map` | [`HashMap<K, V>`][std::collections::HashMap], [`BTreeMap<K, V>`][std::collections::BTreeMap] | `Map` |
377/// | `unit_variant`, `struct_variant`, `tuple_variant`, `newtype_variant` | `enum E { .. }` | Dense `Union` |
378///
379///
380/// Enums are mapped to dense Arrow `Union` types, with each variant becoming a separate field:
381///
382/// - Unit variants (`V`) map to the `Null` Arrow type, but can also be serialized as arrow string types
383/// - Newtype variants (`V(T)`) map to the inner type `T`
384/// - Tuple variants or struct variants (`V(T0, T1)`, `V { f0: T0 }`) map to the Arrow `Struct` type
385///
386/// [serde-model]: https://serde.rs/data-model.html
387/// [arrow-model]: https://arrow.apache.org/docs/format/Columnar.html
388#[deny(missing_docs)]
389pub mod schema {
390    pub use crate::internal::schema::{
391        Overwrites, SchemaLike, SerdeArrowSchema, Strategy, TracingOptions, STRATEGY_KEY,
392    };
393
394    /// Support for [canonical extension types][ext-docs]. This module is experimental without semver guarantees.
395    ///
396    /// [ext-docs]: https://arrow.apache.org/docs/format/CanonicalExtensions.html
397    pub mod ext {
398        pub use crate::internal::schema::extensions::{
399            Bool8Field, FixedShapeTensorField, VariableShapeTensorField,
400        };
401    }
402}
403
404/// Re-export of the used marrow version
405pub use marrow;