Expand description
A module for creating and manipulating data frames. A data frame can be
created from a SoR file, or by adding Columns or Rows
programmatically.
A data frame in liquid_ml is lightly inspired by those found in R or
pandas, and supports optionally named columns. You may analyze the data
in a data frame by implementing the Rower trait to perform map or
filter operations. These operations can be easily performed on either
LocalDataFrames for data that fits in memory or
DistributedDataFrames for data that is too large to fit in one machine.
Note: If you need a DistributedDataFrame, it is highly recommended
that you check out the LiquidML struct since that provides many
convenient helper functions for working with DistributedDataFrames.
Using a DistributedDataFrame directly is only recommended if you really
know what you are doing. There are also helpful examples of map and
filter in the LiquidML documentation
This dataframe module provides 2 implementations for a data frame:
§LocalDataFrame
A LocalDataFrame can be used to analyze data on a node locally for data
that fits in memory. Is very easy to work with and get up and running
initially when developing. We recommend that when testing and developing
your Rower, that you do so with a LocalDataFrame.
§DistributedDataFrame
A DistributedDataFrame is an abstraction over a distributed system of
nodes that run KVStores which contain chunks of LocalDataFrames.
Therefore each DistributedDataFrame simply holds a pointer to a
KVStore and a map of ranges of row indices to the Keys for the
chunks of data with that range of row indices. A DistributedDataFrame
is immutable to make it trivial for the global state of the data frame to
be consistent.
Because of this the DistributedDataFrame implementation is mainly
concerned with networking and getting and putting chunks of different
KVStores. One of the main concerns are that creating a new
DistributedDataFrame means distributing the Keys of all the chunks
to all nodes and the chunks to their respective owner.
Upon creation, node 1 of a DistributedDataFrame will distribute chunks
of data across multiple nodes from SoR files, iterators, and other
convenient ways of adding data. Note that our experimental testing found
that using the largest chunks possible to fit on each node increased
performance by over 2x. Our from_sor constructor optimizes for large
chunks, but we have no control over the iterators passed in to
from_iter, so if you are using this function yourself and care about
the performance of map and filter, then you should also optimize your
iterators this way.
Data frames use these supplementary data structures and can be useful in understanding DataFrames:
Row: A single row ofDatafrom the data frame and provides a useful API to help implement theRowertraitSchema: This can be especially useful when aSoRFile is read and different things need to be done based on the inferred schema
The dataframe module also declares the Rower and Fielder visitor
traits that can be used to build visitors that iterate over the elements of
a row or data frame.
NOTE: We are likely to add iterators to replace the current visitors, since iterators are more idiomatic to write in rust
Structs§
- Distributed
Data Frame - Represents a distributed, immutable data frame which contains data stored
in a columnar format and a well defined
Schema. Provides convenientmapandfiltermethods that operate on the entire distributed data frame (ie, across different machines) with a givenRower - Local
Data Frame - Represents a local data frame which contains data stored in a columnar
format and a well-defined
Schema. Is useful for data sets that fit into memory or for testing/debugging purposes. - Row
- Represents a single row in a data frame.
- Schema
- Represents a
Schemaof a data frame
Enums§
- Column
- Represents a column of parsed data from a
SoRfile. - Data
- An enumeration of the possible
SoRdata types, that also contains the data itself. - Data
Type - A plain enumeration of the possible data types used in
SoR, this one without its accompanying value.