Struct DataFrame

Source
pub struct DataFrame { /* private fields */ }
Expand description

DataFrame is composed of a SparkSession referencing a Spark Connect enabled cluster, and a LogicalPlanBuilder which represents the unresolved spark::Plan to be submitted to the cluster when an action is called.

The LogicalPlanBuilder is a series of unresolved logical plans, and every additional transformation takes the prior spark::Plan and builds onto it. The final unresolved logical plan is submitted to the spark connect server.

§create_dataframe & range

A DataFrame can be created with an arrow::array::RecordBatch, or with spark.range(...)

let name: ArrayRef = Arc::new(StringArray::from(vec!["Tom", "Alice", "Bob"]));
let age: ArrayRef = Arc::new(Int64Array::from(vec![14, 23, 16]));

let data = RecordBatch::try_from_iter(vec![("name", name), ("age", age)])?

let df = spark.create_dataframe(&data).await?

§sql

A DataFrame is created from a spark.sql() statement

let df = spark.sql("SELECT * FROM json.`/opt/spark/work-dir/datasets/employees.json`").await?;

§read & readStream

A DataFrame is also created from a spark.read() and spark.read_stream() statement.

let df = spark
    .read()
    .format("csv")
    .option("header", "True")
    .option("delimiter", ";")
    .load(paths)?;

Implementations§

Source§

impl DataFrame

Source

pub fn agg<I>(self, exprs: I) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Aggregate on the entire DataFrame without groups (shorthand for df.group_by().agg())

Source

pub fn alias(self, alias: &str) -> DataFrame

Returns a new DataFrame with an alias set.

Source

pub async fn approx_quantile<I, P>( self, cols: I, probabilities: P, relative_error: f64, ) -> Result<RecordBatch, SparkError>
where I: IntoIterator<Item: AsRef<str>>, P: IntoIterator<Item = f64>,

Calculates the approximate quantiles of numerical columns of a DataFrame.

Source

pub async fn cache(self) -> DataFrame

Persists the DataFrame with the default storage::StorageLevel::MemoryAndDiskDeser (MEMORY_AND_DISK_DESER).

Source

pub fn coalesce(self, num_partitions: u32) -> DataFrame

Returns a new DataFrame that has exactly num_partitions partitions.

Source

pub fn col_regex(self, col_name: &str) -> Column

Selects column based on the column name specified as a regex and returns it as Column.

Source

pub async fn collect(self) -> Result<RecordBatch, SparkError>

Returns all records as a RecordBatch

§Example:
async {
    df.collect().await?;
}
Source

pub async fn columns(self) -> Result<Vec<String>, SparkError>

Retrieves the names of all columns in the DataFrame as a Vec<String>. The order of the column names in the list reflects their order in the DataFrame.

Source

pub async fn corr(self, col1: &str, col2: &str) -> Result<f64, SparkError>

Calculates the correlation of two columns of a DataFrame as a f64. Currently only supports the Pearson Correlation Coefficient.

Source

pub async fn count(self) -> Result<i64, SparkError>

Returns the number of rows in this DataFrame

Source

pub async fn cov(self, col1: &str, col2: &str) -> Result<f64, SparkError>

Calculate the sample covariance for the given columns, specified by their names, as a f64

Source

pub async fn create_global_temp_view(self, name: &str) -> Result<(), SparkError>

Creates a global temporary view with this DataFrame.

Source

pub async fn create_or_replace_global_temp_view( self, name: &str, ) -> Result<(), SparkError>

Creates or replaces a global temporary view using the given name.

Source

pub async fn create_or_replace_temp_view( self, name: &str, ) -> Result<(), SparkError>

Creates or replaces a local temporary view with this DataFrame

Source

pub async fn create_temp_view(self, name: &str) -> Result<(), SparkError>

Creates a local temporary view with this DataFrame

Source

pub fn cross_join(self, other: DataFrame) -> DataFrame

Returns the cartesian product with another DataFrame.

Source

pub fn crosstab(self, col1: &str, col2: &str) -> DataFrame

Computes a pair-wise frequency table of the given columns. Also known as a contingency table.

Source

pub fn cube<I>(self, cols: I) -> GroupedData
where I: IntoIterator<Item: Into<Column>>,

Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.

Source

pub fn describe<I, T>(self, cols: Option<I>) -> DataFrame
where I: IntoIterator<Item = T>, T: AsRef<str>,

Source

pub fn distinct(self) -> DataFrame

Returns a new DataFrame containing the distinct rows in this DataFrame.

Source

pub fn drop<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Returns a new DataFrame without the specified columns

Source

pub fn drop_duplicates<I, T>(self, cols: Option<I>) -> DataFrame
where I: IntoIterator<Item = T>, T: AsRef<str>,

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns from a Vec<String>

If no columns are supplied then it all columns are used

Source

pub fn drop_duplicates_within_waterwmark<I, T>( self, cols: Option<I>, ) -> DataFrame
where I: IntoIterator<Item = T>, T: AsRef<str>,

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns, within watermark.

This only works with streaming DataFrame, and watermark for the input DataFrame must be set via with_watermark().

Source

pub fn dropna( self, how: &str, threshold: Option<i32>, subset: Option<Vec<&str>>, ) -> DataFrame

Returns a new DataFrame omitting rows with null values.

Source

pub async fn dtypes(self) -> Result<Vec<(String, Kind)>, SparkError>

Returns all column names and their data types as a Vec containing the field name as a String and the spark::data_type::Kind enum

Source

pub fn except_all(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

Source

pub async fn explain( self, mode: Option<ExplainMode>, ) -> Result<String, SparkError>

Prints the spark::Plan to the console

§Arguments:
  • mode: ExplainMode Defaults to unspecified
    • simple
    • extended
    • codegen
    • cost
    • formatted
    • unspecified
Source

pub fn fillna<I, T, L>(self, cols: Option<I>, values: T) -> DataFrame
where I: IntoIterator<Item: AsRef<str>>, T: IntoIterator<Item = L>, L: Into<Literal>,

Replace null values, alias for df.na().fill().

Source

pub fn filter(self, condition: impl ToFilterExpr) -> DataFrame

Filters rows using a given conditions and returns a new DataFrame

§Example:
async {
    df.filter("salary > 4000").collect().await?;
}
Source

pub async fn first(self) -> Result<RecordBatch, SparkError>

Returns the first row as a RecordBatch.

Source

pub fn freq_items<I, S>(self, cols: I, support: Option<f64>) -> DataFrame
where I: IntoIterator<Item = S>, S: AsRef<str>,

Finding frequent items for columns, possibly with false positives.

Source

pub fn group_by<I>(self, cols: Option<I>) -> GroupedData
where I: IntoIterator<Item: Into<Column>>,

Groups the DataFrame using the specified columns, and returns a GroupedData object

Source

pub async fn head(self, n: Option<i32>) -> Result<RecordBatch, SparkError>

Returns the first n rows.

Source

pub fn hint<I>(self, name: &str, parameters: Option<I>) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Specifies some hint on the current DataFrame.

Source

pub async fn input_files(self) -> Result<Vec<String>, SparkError>

Returns a best-effort snapshot of the files that compose this DataFrame

Source

pub fn intersect(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.

Source

pub fn intersect_all(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.

Source

pub async fn is_empty(self) -> Result<bool, SparkError>

Checks if the DataFrame is empty and returns a boolean value.

Source

pub async fn is_local(self) -> Result<bool, SparkError>

Returns true if the collect() and take() methods can be run locally (without any Spark executors).

Source

pub async fn is_streaming(self) -> Result<bool, SparkError>

Returns true if this DataFrame contains one or more sources that continuously return data as it arrives.

Source

pub fn join<T: Into<Expression>>( self, other: DataFrame, on: Option<T>, how: JoinType, ) -> DataFrame

Joins with another DataFrame, using the given join expression.

§Example:
use spark_connect_rs::functions::col;
use spark_connect_rs::dataframe::JoinType;

async {
    // join two dataframes where `id` == `name`
    let condition = Some(col("id").eq(col("name")));
    let df = df.join(df2, condition, JoinType::Inner);
}
Source

pub fn limit(self, limit: i32) -> DataFrame

Limits the result count o thte number specified and returns a new DataFrame

§Example:
async {
    df.limit(10).collect().await?;
}
Source

pub fn melt<I>( self, ids: I, values: Option<I>, variable_column_name: &str, value_column_name: &str, ) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Source

pub fn na(self) -> DataFrameNaFunctions

Returns a DataFrameNaFunctions for handling missing values.

Source

pub fn offset(self, num: i32) -> DataFrame

Returns a new DataFrame by skiping the first n rows

Source

pub fn order_by<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Returns a new DataFrame sorted by the specified column(s).

Source

pub async fn persist(self, storage_level: StorageLevel) -> DataFrame

Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed.

Source

pub async fn print_schema( self, level: Option<i32>, ) -> Result<String, SparkError>

Prints out the schema in the tree format to a specific level number.

Source

pub fn random_split<I>(self, weights: I, seed: Option<i64>) -> Vec<DataFrame>
where I: IntoIterator<Item = f64> + Clone,

Randomly splits this DataFrame with the provided weights.

Source

pub fn repartition( self, num_partitions: u32, shuffle: Option<bool>, ) -> DataFrame

Returns a new DataFrame partitioned by the given partition number and shuffle option

§Arguments
  • num_partitions: the target number of partitions
  • (optional) shuffle: to induce a shuffle. Default is false
Source

pub fn repartition_by_range<I>( self, num_partitions: Option<i32>, cols: I, ) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Returns a new DataFrame partitioned by the given partitioning expressions.

Source

pub fn replace<I, T>( self, to_replace: T, value: T, subset: Option<I>, ) -> DataFrame
where I: IntoIterator<Item: AsRef<str>>, T: IntoIterator<Item: Into<Literal>>,

Returns a new DataFrame replacing a value with another value.

Source

pub fn rollup<I>(self, cols: I) -> GroupedData
where I: IntoIterator<Item: Into<Column>>,

Create a multi-dimensional rollup for the current DataFrame using the specified columns, and returns a GroupedData object

Source

pub async fn same_semantics(self, other: DataFrame) -> Result<bool, SparkError>

Returns True when the logical query plans inside both DataFrames are equal and therefore return the same results.

Source

pub fn sample( self, lower_bound: f64, upper_bound: f64, with_replacement: Option<bool>, seed: Option<i64>, ) -> DataFrame

Returns a sampled subset of this DataFrame

Source

pub fn sample_by<K, I>( self, col: Column, fractions: I, seed: Option<i64>, ) -> DataFrame
where K: Into<Literal>, I: IntoIterator<Item = (K, f64)>,

Returns a stratified sample without replacement based on the fraction given on each stratum.

Source

pub async fn schema(self) -> Result<DataType, SparkError>

Returns the schema of this DataFrame as a spark::DataType which contains the schema of a DataFrame

Source

pub fn select<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Projects a set of expressions and returns a new DataFrame

§Arguments:
  • cols - An iterable of values that can be Columns
§Example:
async {
    df.select(vec![col("age"), col("name")]).collect().await?;
}
Source

pub fn select_expr<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: AsRef<str>>,

Project a set of SQL expressions and returns a new DataFrame

This is a variant of select that accepts SQL Expressions

§Example:
async {
    df.select_expr(vec!["id * 2", "abs(id)"]).collect().await?;
}
Source

pub async fn semantic_hash(self) -> Result<i32, SparkError>

Returns a hash code of the logical query plan against this DataFrame.

Source

pub async fn show( self, num_rows: Option<i32>, truncate: Option<i32>, vertical: Option<bool>, ) -> Result<(), SparkError>

Prints the first n rows to the console

§Arguments:
  • num_row: (int, optional) number of rows to show (default 10)
  • truncate: (int, optional) If set to 0, it truncates the string. Any other number will not truncate the strings
  • vertical: (bool, optional) If set to true, prints output rows vertically (one line per column value).
Source

pub fn sort<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Returns a new DataFrame sorted by the specified column(s).

Source

pub fn sort_within_partitions<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: Into<Column>>,

Returns a new DataFrame with each partition sorted by the specified column(s).

Source

pub fn spark_session(self) -> Box<SparkSession>

Returns Spark session that created this DataFrame.

Source

pub fn stat(self) -> DataFrameStatFunctions

Returns a DataFrameStatFunctions for statistic functions.

Source

pub async fn storage_level(self) -> Result<StorageLevel, SparkError>

Get the DataFrame’s current storage level.

Source

pub fn subtract(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

Source

pub fn summary<I>(self, statistics: Option<I>) -> DataFrame
where I: IntoIterator<Item: AsRef<str>>,

Computes specified statistics for numeric and string columns. Available statistics are: - count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (e.g., 75%)

If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max

Source

pub async fn tail(self, limit: i32) -> Result<RecordBatch, SparkError>

Returns the last n rows as a RecordBatch

Running tail requires moving the data and results in an action

Source

pub async fn take(self, n: i32) -> Result<RecordBatch, SparkError>

Returns the first num rows as a RecordBatch.

Source

pub fn to_df<I>(self, cols: I) -> DataFrame
where I: IntoIterator<Item: AsRef<str>>,

Returns a new DataFrame that with new specified column names

Source

pub async fn to_json(self) -> Result<String, SparkError>

Converts a DataFrame into String representation of JSON

Each row is turned into a JSON document

Source

pub fn transform<F>(self, func: F) -> DataFrame
where F: FnMut(DataFrame) -> DataFrame,

Returns a new DataFrame based on a provided closure.

§Example:
// the closure will capture this variable from the current scope
let val = 100;

let add_new_col =
    |df: DataFrame| -> DataFrame { df.withColumn("new_col", lit(val)).select("new_col") };

df = df.transform(add_new_col);
Source

pub fn union(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing the union of rows in this and another DataFrame.

Source

pub fn union_all(self, other: DataFrame) -> DataFrame

Return a new DataFrame containing the union of rows in this and another DataFrame.

Source

pub fn union_by_name( self, other: DataFrame, allow_missing_columns: Option<bool>, ) -> DataFrame

Returns a new DataFrame containing union of rows in this and another DataFrame.

Source

pub async fn unpersist(self, blocking: Option<bool>) -> DataFrame

Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

Source

pub fn unpivot<I, T>( self, ids: I, values: Option<T>, variable_column_name: &str, value_column_name: &str, ) -> DataFrame
where T: IntoIterator<Item: Into<Column>>, I: IntoIterator<Item: Into<Column>>,

Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(…).pivot(…).agg(…), except for the aggregation, which cannot be reversed.

Source

pub fn with_column(self, col_name: &str, col: Column) -> DataFrame

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Source

pub fn with_columns<I, K>(self, col_map: I) -> DataFrame
where I: IntoIterator<Item = (K, Column)>, K: AsRef<str>,

Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names.

Source

pub fn with_column_renamed<K, V>(self, existing: K, new: V) -> DataFrame
where K: AsRef<str>, V: AsRef<str>,

Returns a new DataFrame by renaming an existing column.

Source

pub fn with_columns_renamed<I, K, V>(self, cols: I) -> DataFrame
where I: IntoIterator<Item = (K, V)>, K: AsRef<str>, V: AsRef<str>,

Returns a new DataFrame by renaming multiple columns from a an iterator of containing a key/value pair with the key as the existing column name and the value as the new column name.

Source

pub fn with_metadata(self, col: &str, metadata: &str) -> DataFrame

Returns a new DataFrame by updating an existing column with metadata.

Source

pub fn with_watermark( self, event_time: &str, delay_threshold: &str, ) -> DataFrame

Defines an event time watermark for this DataFrame.

Source

pub fn write(self) -> DataFrameWriter

Returns a DataFrameWriter struct based on the current DataFrame

Source

pub fn write_stream(self) -> DataStreamWriter

Interface for DataStreamWriter to save the content of the streaming DataFrame out into external storage.

Source

pub fn write_to(self, table: &str) -> DataFrameWriterV2

Create a write configuration builder for v2 sources with DataFrameWriterV2.

Trait Implementations§

Source§

impl Clone for DataFrame

Source§

fn clone(&self) -> DataFrame

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for DataFrame

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> FromRef<T> for T
where T: Clone,

Source§

fn from_ref(input: &T) -> T

Converts to this type from a reference to the input type.
Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoRequest<T> for T

Source§

fn into_request(self) -> Request<T>

Wrap the input message T in a tonic::Request
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> ErasedDestructor for T
where T: 'static,

Source§

impl<T> MaybeSendSync for T