1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
//! A module for creating and manipulating data frames. A data frame can be
//! created from a [`SoR`] file, or by adding [`Column`]s or [`Row`]s
//! programmatically.
//!
//! A data frame in `liquid_ml` is lightly inspired by those found in `R` or
//! `pandas`, and supports optionally named columns. You may analyze the data
//! in a data frame by implementing the [`Rower`] trait to perform `map` or
//! `filter` operations. These operations can be easily performed on either
//! [`LocalDataFrame`]s for data that fits in memory or
//! [`DistributedDataFrame`]s for data that is too large to fit in one machine.
//!
//! **Note**: If you need a [`DistributedDataFrame`], it is highly recommended
//! that you check out the [`LiquidML`] struct since that provides many
//! convenient helper functions for working with [`DistributedDataFrame`]s.
//! Using a [`DistributedDataFrame`] directly is only recommended if you really
//! know what you are doing. There are also helpful examples of `map` and
//! `filter` in the [`LiquidML`] documentation
//!
//! This `dataframe` module provides 2 implementations for a data frame:
//!
//! # [`LocalDataFrame`]
//!
//! A [`LocalDataFrame`] can be used to analyze data on a node locally for data
//! that fits in memory. Is very easy to work with and get up and running
//! initially when developing. We recommend that when testing and developing
//! your [`Rower`], that you do so with a [`LocalDataFrame`].
//!
//! # [`DistributedDataFrame`]
//!
//! A [`DistributedDataFrame`] is an abstraction over a distributed system of
//! nodes that run [`KVStore`]s which contain chunks of [`LocalDataFrame`]s.
//! Therefore each [`DistributedDataFrame`] simply holds a pointer to a
//! [`KVStore`] and a map of ranges of row indices to the [`Key`]s for the
//! chunks of data with that range of row indices. A [`DistributedDataFrame`]
//! is immutable to make it trivial for the global state of the data frame to
//! be consistent.
//!
//! Because of this the [`DistributedDataFrame`] implementation is mainly
//! concerned with networking and getting and putting chunks of different
//! [`KVStore`]s. One of the main concerns are that creating a new
//! [`DistributedDataFrame`] means distributing the [`Key`]s of all the chunks
//! to all nodes and the chunks to their respective owner.
//!
//! Upon creation, node 1 of a [`DistributedDataFrame`] will distribute chunks
//! of data across multiple nodes from [`SoR`] files, iterators, and other
//! convenient ways of adding data. Note that our experimental testing found
//! that using the largest chunks possible to fit on each node increased
//! performance by over `2x`. Our [`from_sor`] constructor optimizes for large
//! chunks, but we have no control over the iterators passed in to
//! [`from_iter`], so if you are using this function yourself and care about
//! the performance of `map` and `filter`, then you should also optimize your
//! iterators this way.
//!
//! Data frames use these supplementary data structures and can be useful in
//! understanding DataFrames:
//! - [`Row`] : A single row of [`Data`] from the data frame and provides a
//! useful API to help implement the [`Rower`] trait
//! - [`Schema`] : This can be especially useful when a [`SoR`] File is read and
//! different things need to be done based on the inferred schema
//!
//! The `dataframe` module also declares the [`Rower`] and [`Fielder`] visitor
//! traits that can be used to build visitors that iterate over the elements of
//! a row or data frame.
//!
//! NOTE: We are likely to add iterators to replace the current visitors, since
//! iterators are more idiomatic to write in rust
//!
//! [`Column`]: struct.Column.html
//! [`Row`]: struct.Row.html
//! [`Rower`]: trait.Rower.html
//! [`Fielder`]: trait.Fielder.html
//! [`Schema`]: struct.Schema.html
//! [`Data`]: struct.Data.html
//! [`LocalDataFrame`]: struct.LocalDataFrame.html
//! [`DistributedDataFrame`]: struct.DistributedDataFrame.html
//! [`LiquidML`]: ../struct.LiquidML.html
//! [`KVStore`]: ../kv/struct.KVStore.html
//! [`Key`]: ../kv/struct.Key.html
//! [`SoR`]: https://docs.rs/sorer
//! [`from_sor`]: struct.DistributedDataFrame.html#method.from_sor
//! [`from_iter`]: struct.DistributedDataFrame.html#method.from_iter
pub use ;
pub use DistributedDataFrame;
pub use LocalDataFrame;
pub use Row;
pub use Schema;
/// A field visitor that may be implemented to iterate and visit all the
/// elements of a [`Row`].
///
/// [`Row`]: struct.Row.html
/// A trait for visitors who iterate through and process each row of a
/// data frame.