pub struct DataFrame { /* private fields */ }
Expand description
DataFrame is composed of a SparkSession referencing a Spark Connect enabled cluster, and a LogicalPlanBuilder which represents the unresolved spark::Plan to be submitted to the cluster when an action is called.
The LogicalPlanBuilder is a series of unresolved logical plans, and every additional transformation takes the prior spark::Plan and builds onto it. The final unresolved logical plan is submitted to the spark connect server.
§create_dataframe & range
A DataFrame
can be created with an arrow::array::RecordBatch, or with spark.range(...)
let name: ArrayRef = Arc::new(StringArray::from(vec!["Tom", "Alice", "Bob"]));
let age: ArrayRef = Arc::new(Int64Array::from(vec![14, 23, 16]));
let data = RecordBatch::try_from_iter(vec![("name", name), ("age", age)])?
let df = spark.create_dataframe(&data).await?
§sql
A DataFrame
is created from a spark.sql()
statement
let df = spark.sql("SELECT * FROM json.`/opt/spark/work-dir/datasets/employees.json`").await?;
§read & readStream
A DataFrame
is also created from a spark.read()
and spark.read_stream()
statement.
let df = spark
.read()
.format("csv")
.option("header", "True")
.option("delimiter", ";")
.load(paths)?;
Implementations§
Source§impl DataFrame
impl DataFrame
Sourcepub fn new(spark_session: SparkSession, plan: LogicalPlanBuilder) -> DataFrame
pub fn new(spark_session: SparkSession, plan: LogicalPlanBuilder) -> DataFrame
create default DataFrame based on a spark session and initial logical plan
Sourcepub fn agg<T: ToVecExpr>(self, exprs: T) -> DataFrame
pub fn agg<T: ToVecExpr>(self, exprs: T) -> DataFrame
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()
)
Sourcepub async fn cache(self) -> DataFrame
pub async fn cache(self) -> DataFrame
Persists the DataFrame with the default storage::StorageLevel::MemoryAndDiskDeser (MEMORY_AND_DISK_DESER).
Sourcepub fn coalesce(self, num_partitions: u32) -> DataFrame
pub fn coalesce(self, num_partitions: u32) -> DataFrame
Returns a new DataFrame that has exactly num_partitions
partitions.
Sourcepub async fn count(self) -> Result<i64, SparkError>
pub async fn count(self) -> Result<i64, SparkError>
Returns the number of rows in this DataFrame
Sourcepub fn col_regex(self, col_name: &str) -> Column
pub fn col_regex(self, col_name: &str) -> Column
Selects column based on the column name specified as a regex and returns it as Column.
Sourcepub async fn collect(self) -> Result<RecordBatch, SparkError>
pub async fn collect(self) -> Result<RecordBatch, SparkError>
Sourcepub async fn corr(self, col1: &str, col2: &str) -> Result<f64, SparkError>
pub async fn corr(self, col1: &str, col2: &str) -> Result<f64, SparkError>
Calculates the correlation of two columns of a DataFrame as a f64
.
Currently only supports the Pearson Correlation Coefficient.
Sourcepub async fn cov(self, col1: &str, col2: &str) -> Result<f64, SparkError>
pub async fn cov(self, col1: &str, col2: &str) -> Result<f64, SparkError>
Calculate the sample covariance for the given columns, specified by their names, as a f64
Sourcepub async fn create_temp_view(self, name: &str) -> Result<(), SparkError>
pub async fn create_temp_view(self, name: &str) -> Result<(), SparkError>
Creates a local temporary view with this DataFrame.
pub async fn create_global_temp_view(self, name: &str) -> Result<(), SparkError>
pub async fn create_or_replace_global_temp_view( self, name: &str, ) -> Result<(), SparkError>
Sourcepub async fn create_or_replace_temp_view(
self,
name: &str,
) -> Result<(), SparkError>
pub async fn create_or_replace_temp_view( self, name: &str, ) -> Result<(), SparkError>
Creates or replaces a local temporary view with this DataFrame
Sourcepub fn cross_join(self, other: DataFrame) -> DataFrame
pub fn cross_join(self, other: DataFrame) -> DataFrame
Returns the cartesian product with another DataFrame.
Sourcepub fn crosstab(self, col1: &str, col2: &str) -> DataFrame
pub fn crosstab(self, col1: &str, col2: &str) -> DataFrame
Computes a pair-wise frequency table of the given columns. Also known as a contingency table.
Sourcepub fn cube<T: ToVecExpr>(self, cols: T) -> GroupedData
pub fn cube<T: ToVecExpr>(self, cols: T) -> GroupedData
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.
pub fn describe<'a, I>(self, cols: Option<I>) -> DataFrame
Sourcepub fn drop<T: ToVecExpr>(self, cols: T) -> DataFrame
pub fn drop<T: ToVecExpr>(self, cols: T) -> DataFrame
Returns a new DataFrame without the specified columns
Sourcepub fn drop_duplicates(self, cols: Option<Vec<&str>>) -> DataFrame
pub fn drop_duplicates(self, cols: Option<Vec<&str>>) -> DataFrame
Return a new DataFrame with duplicate rows removed,
optionally only considering certain columns from a Vec<String>
If no columns are supplied then it all columns are used
Sourcepub fn dropna(
self,
how: &str,
threshold: Option<i32>,
subset: Option<Vec<&str>>,
) -> DataFrame
pub fn dropna( self, how: &str, threshold: Option<i32>, subset: Option<Vec<&str>>, ) -> DataFrame
Returns a new DataFrame omitting rows with null values.
Sourcepub async fn dtypes(self) -> Result<Vec<(String, Kind)>, SparkError>
pub async fn dtypes(self) -> Result<Vec<(String, Kind)>, SparkError>
Returns all column names and their data types as a Vec containing the field name as a String and the spark::data_type::Kind enum
Sourcepub fn except_all(self, other: DataFrame) -> DataFrame
pub fn except_all(self, other: DataFrame) -> DataFrame
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.
Sourcepub async fn explain(
self,
mode: Option<ExplainMode>,
) -> Result<String, SparkError>
pub async fn explain( self, mode: Option<ExplainMode>, ) -> Result<String, SparkError>
Prints the spark::Plan to the console
§Arguments:
mode
: ExplainMode Defaults tounspecified
simple
extended
codegen
cost
formatted
unspecified
Sourcepub fn filter<T: ToFilterExpr>(self, condition: T) -> DataFrame
pub fn filter<T: ToFilterExpr>(self, condition: T) -> DataFrame
Sourcepub async fn first(self) -> Result<RecordBatch, SparkError>
pub async fn first(self) -> Result<RecordBatch, SparkError>
Returns the first row as a RecordBatch.
Sourcepub fn freq_items<'a, I>(self, cols: I, support: Option<f64>) -> DataFramewhere
I: IntoIterator<Item = &'a str>,
pub fn freq_items<'a, I>(self, cols: I, support: Option<f64>) -> DataFramewhere
I: IntoIterator<Item = &'a str>,
Finding frequent items for columns, possibly with false positives.
Sourcepub fn group_by<T: ToVecExpr>(self, cols: Option<T>) -> GroupedData
pub fn group_by<T: ToVecExpr>(self, cols: Option<T>) -> GroupedData
Groups the DataFrame using the specified columns, and returns a GroupedData object
Sourcepub async fn head(self, n: Option<i32>) -> Result<RecordBatch, SparkError>
pub async fn head(self, n: Option<i32>) -> Result<RecordBatch, SparkError>
Returns the first n rows.
Sourcepub fn hint<T: ToVecExpr>(self, name: &str, parameters: Option<T>) -> DataFrame
pub fn hint<T: ToVecExpr>(self, name: &str, parameters: Option<T>) -> DataFrame
Specifies some hint on the current DataFrame.
Sourcepub async fn input_files(self) -> Result<Vec<String>, SparkError>
pub async fn input_files(self) -> Result<Vec<String>, SparkError>
Returns a best-effort snapshot of the files that compose this DataFrame
Sourcepub fn intersect(self, other: DataFrame) -> DataFrame
pub fn intersect(self, other: DataFrame) -> DataFrame
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.
pub fn intersect_all(self, other: DataFrame) -> DataFrame
Sourcepub async fn is_empty(self) -> Result<bool, SparkError>
pub async fn is_empty(self) -> Result<bool, SparkError>
Checks if the DataFrame is empty and returns a boolean value.
Sourcepub async fn is_streaming(self) -> Result<bool, SparkError>
pub async fn is_streaming(self) -> Result<bool, SparkError>
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.
Sourcepub fn join<T: ToExpr>(
self,
other: DataFrame,
on: Option<T>,
how: JoinType,
) -> DataFrame
pub fn join<T: ToExpr>( self, other: DataFrame, on: Option<T>, how: JoinType, ) -> DataFrame
Joins with another DataFrame, using the given join expression.
§Example:
use spark_connect_rs::functions::col;
use spark_connect_rs::dataframe::JoinType;
async {
// join two dataframes where `id` == `name`
let condition = Some(col("id").eq(col("name")));
let df = df.join(df2, condition, JoinType::Inner);
}
Sourcepub fn melt<I, K>(
self,
ids: I,
values: Option<K>,
variable_column_name: &str,
value_column_name: &str,
) -> DataFrame
pub fn melt<I, K>( self, ids: I, values: Option<K>, variable_column_name: &str, value_column_name: &str, ) -> DataFrame
Alias for DataFrame::unpivot
pub fn order_by<I>(self, cols: I) -> DataFramewhere
I: IntoIterator<Item = Column>,
pub async fn persist(self, storage_level: StorageLevel) -> DataFrame
Sourcepub async fn print_schema(
self,
level: Option<i32>,
) -> Result<String, SparkError>
pub async fn print_schema( self, level: Option<i32>, ) -> Result<String, SparkError>
Prints out the schema in the tree format to a specific level number.
Sourcepub fn rollup<T: ToVecExpr>(self, cols: T) -> GroupedData
pub fn rollup<T: ToVecExpr>(self, cols: T) -> GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, and returns a GroupedData object
Sourcepub async fn same_semantics(self, other: DataFrame) -> Result<bool, SparkError>
pub async fn same_semantics(self, other: DataFrame) -> Result<bool, SparkError>
Returns True when the logical query plans inside both DataFrames are equal and therefore return the same results.
Sourcepub fn sample(
self,
lower_bound: f64,
upper_bound: f64,
with_replacement: Option<bool>,
seed: Option<i64>,
) -> DataFrame
pub fn sample( self, lower_bound: f64, upper_bound: f64, with_replacement: Option<bool>, seed: Option<i64>, ) -> DataFrame
Returns a sampled subset of this DataFrame
Sourcepub async fn schema(self) -> Result<DataType, SparkError>
pub async fn schema(self) -> Result<DataType, SparkError>
Returns the schema of this DataFrame as a spark::DataType which contains the schema of a DataFrame
Sourcepub fn select_expr<'a, I>(self, cols: I) -> DataFramewhere
I: IntoIterator<Item = &'a str>,
pub fn select_expr<'a, I>(self, cols: I) -> DataFramewhere
I: IntoIterator<Item = &'a str>,
pub async fn semantic_hash(self) -> Result<i32, SparkError>
Sourcepub async fn show(
self,
num_rows: Option<i32>,
truncate: Option<i32>,
vertical: Option<bool>,
) -> Result<(), SparkError>
pub async fn show( self, num_rows: Option<i32>, truncate: Option<i32>, vertical: Option<bool>, ) -> Result<(), SparkError>
Prints the first n
rows to the console
§Arguments:
num_row
: (int, optional) number of rows to show (default 10)truncate
: (int, optional) If set to 0, it truncates the string. Any other number will not truncate the stringsvertical
: (bool, optional) If set to true, prints output rows vertically (one line per column value).
pub fn sort<I>(self, cols: I) -> DataFramewhere
I: IntoIterator<Item = Column>,
pub fn spark_session(self) -> Box<SparkSession>
pub async fn storage_level(self) -> Result<StorageLevel, SparkError>
pub fn subtract(self, other: DataFrame) -> DataFrame
Sourcepub async fn tail(self, limit: i32) -> Result<RecordBatch, SparkError>
pub async fn tail(self, limit: i32) -> Result<RecordBatch, SparkError>
Returns the last n
rows as a RecordBatch
Running tail requires moving the data and results in an action
pub async fn take(self, n: i32) -> Result<RecordBatch, SparkError>
pub fn to_df<'a, I>(self, cols: I) -> DataFramewhere
I: IntoIterator<Item = &'a str>,
Sourcepub async fn to_json(self) -> Result<String, SparkError>
pub async fn to_json(self) -> Result<String, SparkError>
Converts a DataFrame into String representation of JSON
Each row is turned into a JSON document
pub fn union(self, other: DataFrame) -> DataFrame
pub fn union_all(self, other: DataFrame) -> DataFrame
pub fn union_by_name( self, other: DataFrame, allow_missing_columns: Option<bool>, ) -> DataFrame
pub async fn unpersist(self, blocking: Option<bool>) -> DataFrame
Sourcepub fn unpivot<I, K>(
self,
ids: I,
values: Option<K>,
variable_column_name: &str,
value_column_name: &str,
) -> DataFrame
pub fn unpivot<I, K>( self, ids: I, values: Option<K>, variable_column_name: &str, value_column_name: &str, ) -> DataFrame
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(…).pivot(…).agg(…), except for the aggregation, which cannot be reversed.
pub fn with_column(self, col_name: &str, col: Column) -> DataFrame
pub fn with_columns<I, K>(self, col_map: I) -> DataFrame
Sourcepub fn with_columns_renamed<I, K, V>(self, cols: I) -> DataFrame
pub fn with_columns_renamed<I, K, V>(self, cols: I) -> DataFrame
Returns a new DataFrame by renaming multiple columns from a
an iterator of containing a key/value pair with the key as the existing
column name and the value as the new
column name.
Sourcepub fn write(self) -> DataFrameWriter
pub fn write(self) -> DataFrameWriter
Returns a DataFrameWriter struct based on the current DataFrame
pub fn write_to(self, table: &str) -> DataFrameWriterV2
Sourcepub fn write_stream(self) -> DataStreamWriter
pub fn write_stream(self) -> DataStreamWriter
Interface for DataStreamWriter to save the content of the streaming DataFrame out into external storage.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for DataFrame
impl !RefUnwindSafe for DataFrame
impl Send for DataFrame
impl Sync for DataFrame
impl Unpin for DataFrame
impl !UnwindSafe for DataFrame
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoRequest<T> for T
impl<T> IntoRequest<T> for T
Source§fn into_request(self) -> Request<T>
fn into_request(self) -> Request<T>
T
in a tonic::Request