DistributedDataFrame

Struct DistributedDataFrame 

Source
pub struct DistributedDataFrame {
    pub schema: Schema,
    pub df_name: String,
    pub df_chunk_map: HashMap<Range<usize>, Key>,
    pub num_rows: usize,
    pub node_id: usize,
    pub num_nodes: usize,
    pub server_addr: String,
    pub my_ip: String,
    /* private fields */
}
Expand description

Represents a distributed, immutable data frame which contains data stored in a columnar format and a well defined Schema. Provides convenient map and filter methods that operate on the entire distributed data frame (ie, across different machines) with a given Rower

Fields§

§schema: Schema

The Schema of this DistributedDataFrame

§df_name: String

The name of this DistributedDataFrame. Must be unique in a LiquidML instance

§df_chunk_map: HashMap<Range<usize>, Key>

A map of the range of row indices to the Keys that point to the chunk of data with those rows. Not all Keys in this map belong to this node of the DistributedDataFrame, some may belong to other nodes

§num_rows: usize

The number of rows in this entire DistributedDataFrame

§node_id: usize

The id of the node this DistributedDataFrame is running on

§num_nodes: usize

How many nodes are there in this DistributedDataFrame?

§server_addr: String

What’s the address of the Server?

§my_ip: String

What’s my IP address?

Implementations§

Source§

impl DistributedDataFrame

Source

pub fn get_schema(&self) -> &Schema

Obtains a reference to this DistributedDataFrames schema.

Source

pub async fn get( &self, col_idx: usize, row_idx: usize, ) -> Result<Data, LiquidError>

Get the data at the given col_idx, row_idx offsets as a boxed value

Source

pub async fn get_row(&self, index: usize) -> Result<Row, LiquidError>

Returns a clone of the row at the requested index

Source

pub fn get_col_idx(&self, col_name: &str) -> Option<usize>

Get the index of the Column with the given col_name. Returns Some if a Column with the given name exists, or None otherwise.

Source

pub async fn map<T: Rower + Clone + Send + Serialize + DeserializeOwned>( &self, rower: T, ) -> Result<Option<T>, LiquidError>

Perform a distributed map operation on this DistributedDataFrame with the given rower. Returns Some(rower) (of the joined results) if the node_id of this DistributedDataFrame is 1, and None otherwise.

A local pmap is used on each node to map over that nodes’ chunk. By default, each node will use the number of threads available on that machine.

NOTE: There is an important design decision that comes with a distinct trade off here. The trade off is:

  1. Join the last node with the next one until you get to the end. This has reduced memory requirements but a performance impact because of the synchronous network calls
  2. Join all nodes with one node by sending network messages concurrently to the final node. This has increased memory requirements and greater complexity but greater performance because all nodes can asynchronously send to one node at the same time.

This implementation went with option 1 for simplicity reasons

Source

pub async fn filter<T: Rower + Clone + Send + Serialize + DeserializeOwned>( &self, rower: T, ) -> Result<Arc<Self>, LiquidError>

Perform a distributed filter operation on this DistributedDataFrame. This function does not mutate the DistributedDataFrame in anyway, instead, it creates a new DistributedDataFrame of the results. This DistributedDataFrame is returned to every node so that the results are consistent everywhere.

A local pfilter is used on each node to filter over that nodes’ chunks. By default, each node will use the number of threads available on that machine.

It is possible to re-write this to use a bit map of the rows that should remain in the filtered result, but currently this just clones the rows.

Source

pub fn n_rows(&self) -> usize

Return the (total) number of rows across all nodes for this DistributedDataFrame

Source

pub fn n_cols(&self) -> usize

Return the number of columns in this DistributedDataFrame.

Trait Implementations§

Source§

impl Debug for DistributedDataFrame

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V