liquid_ml/dataframe/
mod.rs

1//! A module for creating and manipulating data frames. A data frame can be
2//! created from a [`SoR`] file, or by adding [`Column`]s or [`Row`]s
3//! programmatically.
4//!
5//! A data frame in `liquid_ml` is lightly inspired by those found in `R` or
6//! `pandas`, and supports optionally named columns. You may analyze the data
7//! in a data frame by implementing the [`Rower`] trait to perform `map` or
8//! `filter` operations. These operations can be easily performed on either
9//! [`LocalDataFrame`]s for data that fits in memory or
10//! [`DistributedDataFrame`]s for data that is too large to fit in one machine.
11//!
12//! **Note**: If you need a [`DistributedDataFrame`], it is highly recommended
13//! that you check out the [`LiquidML`] struct since that provides many
14//! convenient helper functions for working with [`DistributedDataFrame`]s.
15//! Using a [`DistributedDataFrame`] directly is only recommended if you really
16//! know what you are doing. There are also helpful examples of `map` and
17//! `filter` in the [`LiquidML`] documentation
18//!
19//! This `dataframe` module provides 2 implementations for a data frame:
20//!
21//! # [`LocalDataFrame`]
22//!
23//! A [`LocalDataFrame`] can be used to analyze data on a node locally for data
24//! that fits in memory. Is very easy to work with and get up and running
25//! initially when developing. We recommend that when testing and developing
26//! your [`Rower`], that you do so with a [`LocalDataFrame`].
27//!
28//! # [`DistributedDataFrame`]
29//!
30//! A [`DistributedDataFrame`] is an abstraction over a distributed system of
31//! nodes that run [`KVStore`]s which contain chunks of [`LocalDataFrame`]s.
32//! Therefore each [`DistributedDataFrame`] simply holds a pointer to a
33//! [`KVStore`] and a map of ranges of row indices to the [`Key`]s for the
34//! chunks of data with that range of row indices. A [`DistributedDataFrame`]
35//! is immutable to make it trivial for the global state of the data frame to
36//! be consistent.
37//!
38//! Because of this the [`DistributedDataFrame`] implementation is mainly
39//! concerned with networking and getting and putting chunks of different
40//! [`KVStore`]s. One of the main concerns are that creating a new
41//! [`DistributedDataFrame`] means distributing the [`Key`]s of all the chunks
42//! to all nodes and the chunks to their respective owner.
43//!
44//! Upon creation, node 1 of a [`DistributedDataFrame`] will distribute chunks
45//! of data across multiple nodes from [`SoR`] files, iterators, and other
46//! convenient ways of adding data. Note that our experimental testing found
47//! that using the largest chunks possible to fit on each node increased
48//! performance by over `2x`. Our [`from_sor`] constructor optimizes for large
49//! chunks, but we have no control over the iterators passed in to
50//! [`from_iter`], so if you are using this function yourself and care about
51//! the performance of `map` and `filter`, then you should also optimize your
52//! iterators this way.
53//!
54//! Data frames use these supplementary data structures and can be useful in
55//! understanding DataFrames:
56//!  - [`Row`] : A single row of [`Data`] from the data frame and provides a
57//!     useful API to help implement the [`Rower`] trait
58//!  - [`Schema`] : This can be especially useful when a [`SoR`] File is read and
59//!     different things need to be done based on the inferred schema
60//!
61//! The `dataframe` module also declares the [`Rower`] and [`Fielder`] visitor
62//! traits that can be used to build visitors that iterate over the elements of
63//! a row or data frame.
64//!
65//! NOTE: We are likely to add iterators to replace the current visitors, since
66//! iterators are more idiomatic to write in rust
67//!
68//! [`Column`]: struct.Column.html
69//! [`Row`]: struct.Row.html
70//! [`Rower`]: trait.Rower.html
71//! [`Fielder`]: trait.Fielder.html
72//! [`Schema`]: struct.Schema.html
73//! [`Data`]: struct.Data.html
74//! [`LocalDataFrame`]: struct.LocalDataFrame.html
75//! [`DistributedDataFrame`]: struct.DistributedDataFrame.html
76//! [`LiquidML`]: ../struct.LiquidML.html
77//! [`KVStore`]: ../kv/struct.KVStore.html
78//! [`Key`]: ../kv/struct.Key.html
79//! [`SoR`]: https://docs.rs/sorer
80//! [`from_sor`]: struct.DistributedDataFrame.html#method.from_sor
81//! [`from_iter`]: struct.DistributedDataFrame.html#method.from_iter
82pub use sorer::{
83    dataframe::{Column, Data},
84    schema::DataType,
85};
86
87mod distributed_dataframe;
88pub use distributed_dataframe::DistributedDataFrame;
89
90mod local_dataframe;
91pub use local_dataframe::LocalDataFrame;
92
93mod row;
94pub use row::Row;
95
96mod schema;
97pub use schema::Schema;
98
99/// A field visitor that may be implemented to iterate and visit all the
100/// elements of a [`Row`].
101///
102/// [`Row`]: struct.Row.html
103pub trait Fielder {
104    /// Called for fields of type `bool` with the value of the field
105    fn visit_bool(&mut self, b: bool);
106
107    /// Called for fields of type `float` with the value of the field
108    fn visit_float(&mut self, f: f64);
109
110    /// Called for fields of type `int` with the value of the field
111    fn visit_int(&mut self, i: i64);
112
113    /// Called for fields of type `String` with the value of the field
114    fn visit_string(&mut self, s: &str);
115
116    /// Called for fields where the value of the field is missing. This method
117    /// may be as simple as doing nothing but there are use cases where
118    /// some operations are required.
119    fn visit_null(&mut self);
120}
121
122/// A trait for visitors who iterate through and process each row of a
123/// data frame.
124pub trait Rower {
125    /// This function is called once per row of a data frame.  The return value
126    /// is used in `filter` methods to indicate whether a row should be kept,
127    /// and is meaningless when using `map`.
128    ///
129    /// # Data Frame Mutability
130    /// Since the `row` that is visited is only an immutable reference, it is
131    /// impossible to mutate a data frame via `map`/`filter` since you can't
132    /// mutate each [`Row`] when visiting them. If you wish to get around this
133    /// (purposeful) limitation, you may define a [`Rower`] that has a
134    /// [`LocalDataFrame`] for one of its fields. Then, in your `visit`
135    /// implementation, you may clone each [`Row`] as you visits them, mutate
136    /// them, then adds it to your [`Rower`]'s copy of the [`LocalDataFrame`].
137    /// This way you will have the original and the mutated copy after
138    /// `map`/`filter`.
139    ///
140    /// [`Row`]: struct.Row.html
141    /// [`Rower`]: trait.Rower.html
142    /// [`LocalDataFrame`]: struct.LocalDataFrame.html
143    fn visit(&mut self, row: &Row) -> bool;
144
145    /// In all cases, except when using single-threaded `map` with a
146    /// [`LocalDataFrame`], the [`Rower`]s being executed in separate threads
147    /// or machines will need to be joined and combined to obtain the final
148    /// result. This may be as simple as adding up each [`Rower`]s sum to get
149    /// a total sum or may be much more complicated. In most cases, it is
150    /// usually trivial. The returned [`Rower`] will contain the final results.
151    fn join(self, other: Self) -> Self;
152}