pub struct DistributedDataFrame {
pub schema: Schema,
pub df_name: String,
pub df_chunk_map: HashMap<Range<usize>, Key>,
pub num_rows: usize,
pub node_id: usize,
pub num_nodes: usize,
pub server_addr: String,
pub my_ip: String,
/* private fields */
}Expand description
Fields§
§schema: SchemaThe Schema of this DistributedDataFrame
df_name: StringThe name of this DistributedDataFrame. Must be unique in a LiquidML
instance
df_chunk_map: HashMap<Range<usize>, Key>A map of the range of row indices to the Keys that point to the chunk
of data with those rows. Not all Keys in this map belong to this node
of the DistributedDataFrame, some may belong to other nodes
num_rows: usizeThe number of rows in this entire DistributedDataFrame
node_id: usizeThe id of the node this DistributedDataFrame is running on
num_nodes: usizeHow many nodes are there in this DistributedDataFrame?
server_addr: StringWhat’s the address of the Server?
my_ip: StringWhat’s my IP address?
Implementations§
Source§impl DistributedDataFrame
impl DistributedDataFrame
Sourcepub fn get_schema(&self) -> &Schema
pub fn get_schema(&self) -> &Schema
Obtains a reference to this DistributedDataFrames schema.
Sourcepub async fn get(
&self,
col_idx: usize,
row_idx: usize,
) -> Result<Data, LiquidError>
pub async fn get( &self, col_idx: usize, row_idx: usize, ) -> Result<Data, LiquidError>
Get the data at the given col_idx, row_idx offsets as a boxed value
Sourcepub async fn get_row(&self, index: usize) -> Result<Row, LiquidError>
pub async fn get_row(&self, index: usize) -> Result<Row, LiquidError>
Returns a clone of the row at the requested index
Sourcepub fn get_col_idx(&self, col_name: &str) -> Option<usize>
pub fn get_col_idx(&self, col_name: &str) -> Option<usize>
Get the index of the Column with the given col_name. Returns Some
if a Column with the given name exists, or None otherwise.
Sourcepub async fn map<T: Rower + Clone + Send + Serialize + DeserializeOwned>(
&self,
rower: T,
) -> Result<Option<T>, LiquidError>
pub async fn map<T: Rower + Clone + Send + Serialize + DeserializeOwned>( &self, rower: T, ) -> Result<Option<T>, LiquidError>
Perform a distributed map operation on this DistributedDataFrame with
the given rower. Returns Some(rower) (of the joined results) if the
node_id of this DistributedDataFrame is 1, and None otherwise.
A local pmap is used on each node to map over that nodes’ chunk.
By default, each node will use the number of threads available on that
machine.
NOTE: There is an important design decision that comes with a distinct trade off here. The trade off is:
- Join the last node with the next one until you get to the end. This has reduced memory requirements but a performance impact because of the synchronous network calls
- Join all nodes with one node by sending network messages concurrently to the final node. This has increased memory requirements and greater complexity but greater performance because all nodes can asynchronously send to one node at the same time.
This implementation went with option 1 for simplicity reasons
Sourcepub async fn filter<T: Rower + Clone + Send + Serialize + DeserializeOwned>(
&self,
rower: T,
) -> Result<Arc<Self>, LiquidError>
pub async fn filter<T: Rower + Clone + Send + Serialize + DeserializeOwned>( &self, rower: T, ) -> Result<Arc<Self>, LiquidError>
Perform a distributed filter operation on this DistributedDataFrame.
This function does not mutate the DistributedDataFrame in anyway,
instead, it creates a new DistributedDataFrame of the results. This
DistributedDataFrame is returned to every node so that the results
are consistent everywhere.
A local pfilter is used on each node to filter over that nodes’
chunks. By default, each node will use the number of threads available
on that machine.
It is possible to re-write this to use a bit map of the rows that should remain in the filtered result, but currently this just clones the rows.
Trait Implementations§
Auto Trait Implementations§
impl !Freeze for DistributedDataFrame
impl !RefUnwindSafe for DistributedDataFrame
impl Send for DistributedDataFrame
impl Sync for DistributedDataFrame
impl Unpin for DistributedDataFrame
impl !UnwindSafe for DistributedDataFrame
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more