Module dataframe

Module dataframe 

Source
Expand description

A module for creating and manipulating data frames. A data frame can be created from a SoR file, or by adding Columns or Rows programmatically.

A data frame in liquid_ml is lightly inspired by those found in R or pandas, and supports optionally named columns. You may analyze the data in a data frame by implementing the Rower trait to perform map or filter operations. These operations can be easily performed on either LocalDataFrames for data that fits in memory or DistributedDataFrames for data that is too large to fit in one machine.

Note: If you need a DistributedDataFrame, it is highly recommended that you check out the LiquidML struct since that provides many convenient helper functions for working with DistributedDataFrames. Using a DistributedDataFrame directly is only recommended if you really know what you are doing. There are also helpful examples of map and filter in the LiquidML documentation

This dataframe module provides 2 implementations for a data frame:

§LocalDataFrame

A LocalDataFrame can be used to analyze data on a node locally for data that fits in memory. Is very easy to work with and get up and running initially when developing. We recommend that when testing and developing your Rower, that you do so with a LocalDataFrame.

§DistributedDataFrame

A DistributedDataFrame is an abstraction over a distributed system of nodes that run KVStores which contain chunks of LocalDataFrames. Therefore each DistributedDataFrame simply holds a pointer to a KVStore and a map of ranges of row indices to the Keys for the chunks of data with that range of row indices. A DistributedDataFrame is immutable to make it trivial for the global state of the data frame to be consistent.

Because of this the DistributedDataFrame implementation is mainly concerned with networking and getting and putting chunks of different KVStores. One of the main concerns are that creating a new DistributedDataFrame means distributing the Keys of all the chunks to all nodes and the chunks to their respective owner.

Upon creation, node 1 of a DistributedDataFrame will distribute chunks of data across multiple nodes from SoR files, iterators, and other convenient ways of adding data. Note that our experimental testing found that using the largest chunks possible to fit on each node increased performance by over 2x. Our from_sor constructor optimizes for large chunks, but we have no control over the iterators passed in to from_iter, so if you are using this function yourself and care about the performance of map and filter, then you should also optimize your iterators this way.

Data frames use these supplementary data structures and can be useful in understanding DataFrames:

  • Row : A single row of Data from the data frame and provides a useful API to help implement the Rower trait
  • Schema : This can be especially useful when a SoR File is read and different things need to be done based on the inferred schema

The dataframe module also declares the Rower and Fielder visitor traits that can be used to build visitors that iterate over the elements of a row or data frame.

NOTE: We are likely to add iterators to replace the current visitors, since iterators are more idiomatic to write in rust

Structs§

DistributedDataFrame
Represents a distributed, immutable data frame which contains data stored in a columnar format and a well defined Schema. Provides convenient map and filter methods that operate on the entire distributed data frame (ie, across different machines) with a given Rower
LocalDataFrame
Represents a local data frame which contains data stored in a columnar format and a well-defined Schema. Is useful for data sets that fit into memory or for testing/debugging purposes.
Row
Represents a single row in a data frame.
Schema
Represents a Schema of a data frame

Enums§

Column
Represents a column of parsed data from a SoR file.
Data
An enumeration of the possible SoR data types, that also contains the data itself.
DataType
A plain enumeration of the possible data types used in SoR, this one without its accompanying value.

Traits§

Fielder
A field visitor that may be implemented to iterate and visit all the elements of a Row.
Rower
A trait for visitors who iterate through and process each row of a data frame.