Struct DataFrame

Source
pub struct DataFrame { /* private fields */ }
Expand description

DataFrame is composed of a SparkSession referencing a Spark Connect enabled cluster, and a LogicalPlanBuilder which represents the unresolved spark::Plan to be submitted to the cluster when an action is called.

The LogicalPlanBuilder is a series of unresolved logical plans, and every additional transformation takes the prior spark::Plan and builds onto it. The final unresolved logical plan is submitted to the spark connect server.

§create_dataframe & range

A DataFrame can be created with an arrow::array::RecordBatch, or with spark.range(...)

let name: ArrayRef = Arc::new(StringArray::from(vec!["Tom", "Alice", "Bob"]));
let age: ArrayRef = Arc::new(Int64Array::from(vec![14, 23, 16]));

let data = RecordBatch::try_from_iter(vec![("name", name), ("age", age)])?

let df = spark.create_dataframe(&data).await?

§sql

A DataFrame is created from a spark.sql() statement

let df = spark.sql("SELECT * FROM json.`/opt/spark/work-dir/datasets/employees.json`").await?;

§read & readStream

A DataFrame is also created from a spark.read() and spark.read_stream() statement.

let df = spark
    .read()
    .format("csv")
    .option("header", "True")
    .option("delimiter", ";")
    .load(paths)?;

Implementations§

Source§

impl DataFrame

Source

pub fn new(spark_session: SparkSession, plan: LogicalPlanBuilder) -> DataFrame

create default DataFrame based on a spark session and initial logical plan

Source

pub fn agg<T: ToVecExpr>(self, exprs: T) -> DataFrame

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg())

Source

pub fn alias(self, alias: &str) -> DataFrame

Returns a new DataFrame with an alias set.

Source

pub async fn cache(self) -> DataFrame

Persists the DataFrame with the default storage::StorageLevel::MemoryAndDiskDeser (MEMORY_AND_DISK_DESER).

Source

pub fn coalesce(self, num_partitions: u32) -> DataFrame

Returns a new DataFrame that has exactly num_partitions partitions.

Source

pub async fn count(self) -> Result<i64, SparkError>

Returns the number of rows in this DataFrame

Source

pub fn col_regex(self, col_name: &str) -> Column

Selects column based on the column name specified as a regex and returns it as Column.

Source

pub async fn collect(self) -> Result<RecordBatch, SparkError>

Returns all records as a RecordBatch

§Example:
async {
    df.collect().await?;
}
Source

pub async fn columns(self) -> Result<Vec<String>, SparkError>

Retrieves the names of all columns in the DataFrame as a Vec<String>. The order of the column names in the list reflects their order in the DataFrame.

Source

pub async fn corr(self, col1: &str, col2: &str) -> Result<f64, SparkError>

Calculates the correlation of two columns of a DataFrame as a f64. Currently only supports the Pearson Correlation Coefficient.

Source

pub async fn cov(self, col1: &str, col2: &str) -> Result<f64, SparkError>

Calculate the sample covariance for the given columns, specified by their names, as a f64

Source

pub async fn create_temp_view(self, name: &str) -> Result<(), SparkError>

Creates a local temporary view with this DataFrame.

Source

pub async fn create_global_temp_view(self, name: &str) -> Result<(), SparkError>

Source

pub async fn create_or_replace_global_temp_view( self, name: &str, ) -> Result<(), SparkError>

Source

pub async fn create_or_replace_temp_view( self, name: &str, ) -> Result<(), SparkError>

Creates or replaces a local temporary view with this DataFrame

Source

pub fn cross_join(self, other: DataFrame) -> DataFrame

Returns the cartesian product with another DataFrame.

Source

pub fn crosstab(self, col1: &str, col2: &str) -> DataFrame

Computes a pair-wise frequency table of the given columns. Also known as a contingency table.

Source

pub fn cube<T: ToVecExpr>(self, cols: T) -> GroupedData

Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.

Source

pub fn describe<'a, I>(self, cols: Option<I>) -> DataFrame
where I: IntoIterator<Item = &'a str> + Default,

Source

pub fn distinct(self) -> DataFrame

Returns a new DataFrame containing the distinct rows in this DataFrame.

Source

pub fn drop<T: ToVecExpr>(self, cols: T) -> DataFrame

Returns a new DataFrame without the specified columns

Source

pub fn drop_duplicates(self, cols: Option<Vec<&str>>) -> DataFrame

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns from a Vec<String>

If no columns are supplied then it all columns are used

Source

pub fn dropna( self, how: &str, threshold: Option<i32>, subset: Option<Vec<&str>>, ) -> DataFrame

Returns a new DataFrame omitting rows with null values.

Source

pub async fn dtypes(self) -> Result<Vec<(String, Kind)>, SparkError>

Returns all column names and their data types as a Vec containing the field name as a String and the spark::data_type::Kind enum

Source

pub fn except_all(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

Source

pub async fn explain( self, mode: Option<ExplainMode>, ) -> Result<String, SparkError>

Prints the spark::Plan to the console

§Arguments:
  • mode: ExplainMode Defaults to unspecified
    • simple
    • extended
    • codegen
    • cost
    • formatted
    • unspecified
Source

pub fn filter<T: ToFilterExpr>(self, condition: T) -> DataFrame

Filters rows using a given conditions and returns a new DataFrame

§Example:
async {
    df.filter("salary > 4000").collect().await?;
}
Source

pub async fn first(self) -> Result<RecordBatch, SparkError>

Returns the first row as a RecordBatch.

Source

pub fn freq_items<'a, I>(self, cols: I, support: Option<f64>) -> DataFrame
where I: IntoIterator<Item = &'a str>,

Finding frequent items for columns, possibly with false positives.

Source

pub fn group_by<T: ToVecExpr>(self, cols: Option<T>) -> GroupedData

Groups the DataFrame using the specified columns, and returns a GroupedData object

Source

pub async fn head(self, n: Option<i32>) -> Result<RecordBatch, SparkError>

Returns the first n rows.

Source

pub fn hint<T: ToVecExpr>(self, name: &str, parameters: Option<T>) -> DataFrame

Specifies some hint on the current DataFrame.

Source

pub async fn input_files(self) -> Result<Vec<String>, SparkError>

Returns a best-effort snapshot of the files that compose this DataFrame

Source

pub fn intersect(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.

Source

pub fn intersect_all(self, other: DataFrame) -> DataFrame

Source

pub async fn is_empty(self) -> Result<bool, SparkError>

Checks if the DataFrame is empty and returns a boolean value.

Source

pub async fn is_streaming(self) -> Result<bool, SparkError>

Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.

Source

pub fn join<T: ToExpr>( self, other: DataFrame, on: Option<T>, how: JoinType, ) -> DataFrame

Joins with another DataFrame, using the given join expression.

§Example:
use spark_connect_rs::functions::col;
use spark_connect_rs::dataframe::JoinType;

async {
    // join two dataframes where `id` == `name`
    let condition = Some(col("id").eq(col("name")));
    let df = df.join(df2, condition, JoinType::Inner);
}
Source

pub fn limit(self, limit: i32) -> DataFrame

Limits the result count o thte number specified and returns a new DataFrame

§Example:
async {
    df.limit(10).collect().await?;
}
Source

pub fn melt<I, K>( self, ids: I, values: Option<K>, variable_column_name: &str, value_column_name: &str, ) -> DataFrame
where I: ToVecExpr, K: ToVecExpr,

Source

pub fn offset(self, num: i32) -> DataFrame

Returns a new DataFrame by skiping the first n rows

Source

pub fn order_by<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item = Column>,

Source

pub async fn persist(self, storage_level: StorageLevel) -> DataFrame

Source

pub async fn print_schema( self, level: Option<i32>, ) -> Result<String, SparkError>

Prints out the schema in the tree format to a specific level number.

Source

pub fn repartition( self, num_partitions: u32, shuffle: Option<bool>, ) -> DataFrame

Returns a new DataFrame partitioned by the given partition number and shuffle option

§Arguments
  • num_partitions: the target number of partitions
  • (optional) shuffle: to induce a shuffle. Default is false
Source

pub fn rollup<T: ToVecExpr>(self, cols: T) -> GroupedData

Create a multi-dimensional rollup for the current DataFrame using the specified columns, and returns a GroupedData object

Source

pub async fn same_semantics(self, other: DataFrame) -> Result<bool, SparkError>

Returns True when the logical query plans inside both DataFrames are equal and therefore return the same results.

Source

pub fn sample( self, lower_bound: f64, upper_bound: f64, with_replacement: Option<bool>, seed: Option<i64>, ) -> DataFrame

Returns a sampled subset of this DataFrame

Source

pub async fn schema(self) -> Result<DataType, SparkError>

Returns the schema of this DataFrame as a spark::DataType which contains the schema of a DataFrame

Source

pub fn select<T: ToVecExpr>(self, cols: T) -> DataFrame

Projects a set of expressions and returns a new DataFrame

§Arguments:
§Example:
async {
    df.select(vec![col("age"), col("name")]).collect().await?;
}
Source

pub fn select_expr<'a, I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item = &'a str>,

Project a set of SQL expressions and returns a new DataFrame

This is a variant of select that accepts SQL Expressions

§Example:
async {
    df.select_expr(vec!["id * 2", "abs(id)"]).collect().await?;
}
Source

pub async fn semantic_hash(self) -> Result<i32, SparkError>

Source

pub async fn show( self, num_rows: Option<i32>, truncate: Option<i32>, vertical: Option<bool>, ) -> Result<(), SparkError>

Prints the first n rows to the console

§Arguments:
  • num_row: (int, optional) number of rows to show (default 10)
  • truncate: (int, optional) If set to 0, it truncates the string. Any other number will not truncate the strings
  • vertical: (bool, optional) If set to true, prints output rows vertically (one line per column value).
Source

pub fn sort<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item = Column>,

Source

pub fn spark_session(self) -> Box<SparkSession>

Source

pub async fn storage_level(self) -> Result<StorageLevel, SparkError>

Source

pub fn subtract(self, other: DataFrame) -> DataFrame

Source

pub async fn tail(self, limit: i32) -> Result<RecordBatch, SparkError>

Returns the last n rows as a RecordBatch

Running tail requires moving the data and results in an action

Source

pub async fn take(self, n: i32) -> Result<RecordBatch, SparkError>

Source

pub fn to_df<'a, I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item = &'a str>,

Source

pub async fn to_json(self) -> Result<String, SparkError>

Converts a DataFrame into String representation of JSON

Each row is turned into a JSON document

Source

pub fn transform<F>(self, func: F) -> DataFrame
where F: FnMut(DataFrame) -> DataFrame,

Returns a new DataFrame based on a provided closure.

§Example:
// the closure will capture this variable from the current scope
let val = 100;

let add_new_col =
    |df: DataFrame| -> DataFrame { df.withColumn("new_col", lit(val)).select("new_col") };

df = df.transform(add_new_col);
Source

pub fn union(self, other: DataFrame) -> DataFrame

Source

pub fn union_all(self, other: DataFrame) -> DataFrame

Source

pub fn union_by_name( self, other: DataFrame, allow_missing_columns: Option<bool>, ) -> DataFrame

Source

pub async fn unpersist(self, blocking: Option<bool>) -> DataFrame

Source

pub fn unpivot<I, K>( self, ids: I, values: Option<K>, variable_column_name: &str, value_column_name: &str, ) -> DataFrame
where I: ToVecExpr, K: ToVecExpr,

Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(…).pivot(…).agg(…), except for the aggregation, which cannot be reversed.

Source

pub fn with_column(self, col_name: &str, col: Column) -> DataFrame

Source

pub fn with_columns<I, K>(self, col_map: I) -> DataFrame
where I: IntoIterator<Item = (K, Column)>, K: ToString,

Source

pub fn with_columns_renamed<I, K, V>(self, cols: I) -> DataFrame
where I: IntoIterator<Item = (K, V)>, K: AsRef<str>, V: AsRef<str>,

Returns a new DataFrame by renaming multiple columns from a an iterator of containing a key/value pair with the key as the existing column name and the value as the new column name.

Source

pub fn write(self) -> DataFrameWriter

Returns a DataFrameWriter struct based on the current DataFrame

Source

pub fn write_to(self, table: &str) -> DataFrameWriterV2

Source

pub fn write_stream(self) -> DataStreamWriter

Interface for DataStreamWriter to save the content of the streaming DataFrame out into external storage.

Trait Implementations§

Source§

impl Clone for DataFrame

Source§

fn clone(&self) -> DataFrame

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for DataFrame

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> FromRef<T> for T
where T: Clone,

Source§

fn from_ref(input: &T) -> T

Converts to this type from a reference to the input type.
Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoRequest<T> for T

Source§

fn into_request(self) -> Request<T>

Wrap the input message T in a tonic::Request
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> ErasedDestructor for T
where T: 'static,

Source§

impl<T> MaybeSendSync for T