pub struct DataFrame { /* private fields */ }Expand description
DataFrame - main tabular data structure.
Thin wrapper around an eager Polars DataFrame.
Implementations§
Source§impl DataFrame
impl DataFrame
Sourcepub fn from_polars(df: PlDataFrame) -> Self
pub fn from_polars(df: PlDataFrame) -> Self
Create a new DataFrame from a Polars DataFrame (case-insensitive column matching by default).
Sourcepub fn from_polars_with_options(df: PlDataFrame, case_sensitive: bool) -> Self
pub fn from_polars_with_options(df: PlDataFrame, case_sensitive: bool) -> Self
Create a new DataFrame from a Polars DataFrame with explicit case sensitivity.
When case_sensitive is false, column resolution is case-insensitive (PySpark default).
Sourcepub fn resolve_expr_column_names(&self, expr: Expr) -> Result<Expr, PolarsError>
pub fn resolve_expr_column_names(&self, expr: Expr) -> Result<Expr, PolarsError>
Resolve column names in a Polars expression against this DataFrame’s schema. When case_sensitive is false, column references (e.g. col(“name”)) are resolved case-insensitively (PySpark default). Use before filter/select_with_exprs/order_by_exprs. Names that appear as alias outputs (e.g. in expr.alias(“partial”)) are not resolved as input columns, so select(col(“x”).substr(1, 3).alias(“partial”)), when().then().otherwise().alias(“result”), and col(“x”).rank().over([]).alias(“rank”) work (issues #200, #212).
Sourcepub fn coerce_string_numeric_comparisons(
&self,
expr: Expr,
) -> Result<Expr, PolarsError>
pub fn coerce_string_numeric_comparisons( &self, expr: Expr, ) -> Result<Expr, PolarsError>
Rewrite comparison expressions to apply PySpark-style type coercion.
This walks the expression tree and, for comparison operators where one side is
a column and the other is a numeric literal, delegates to
coerce_for_pyspark_comparison so that string–numeric comparisons behave like
PySpark (string values parsed to numbers where possible, invalid strings treated
as null/non-matching).
Sourcepub fn resolve_column_name(&self, name: &str) -> Result<String, PolarsError>
pub fn resolve_column_name(&self, name: &str) -> Result<String, PolarsError>
Resolve a logical column name to the actual column name in the schema. When case_sensitive is false, matches case-insensitively.
Sourcepub fn schema(&self) -> Result<StructType, PolarsError>
pub fn schema(&self) -> Result<StructType, PolarsError>
Get the schema of the DataFrame
Sourcepub fn count(&self) -> Result<usize, PolarsError>
pub fn count(&self) -> Result<usize, PolarsError>
Count the number of rows (action - triggers execution)
Sourcepub fn collect(&self) -> Result<Arc<PlDataFrame>, PolarsError>
pub fn collect(&self) -> Result<Arc<PlDataFrame>, PolarsError>
Collect the DataFrame (action - triggers execution)
Sourcepub fn collect_as_json_rows(
&self,
) -> Result<Vec<HashMap<String, JsonValue>>, PolarsError>
pub fn collect_as_json_rows( &self, ) -> Result<Vec<HashMap<String, JsonValue>>, PolarsError>
Collect as rows of column-name -> JSON value. For use by language bindings (Node, etc.).
Sourcepub fn select_exprs(&self, exprs: Vec<Expr>) -> Result<DataFrame, PolarsError>
pub fn select_exprs(&self, exprs: Vec<Expr>) -> Result<DataFrame, PolarsError>
Select columns (returns a new DataFrame). Accepts either column names (strings) or Column expressions (e.g. from regexp_extract_all(…).alias(“m”)). Column names are resolved according to case sensitivity.
Sourcepub fn select(&self, cols: Vec<&str>) -> Result<DataFrame, PolarsError>
pub fn select(&self, cols: Vec<&str>) -> Result<DataFrame, PolarsError>
Select columns by name (returns a new DataFrame). Column names are resolved according to case sensitivity.
Sourcepub fn filter(&self, condition: Expr) -> Result<DataFrame, PolarsError>
pub fn filter(&self, condition: Expr) -> Result<DataFrame, PolarsError>
Filter rows using a Polars expression.
Sourcepub fn column(&self, name: &str) -> Result<Column, PolarsError>
pub fn column(&self, name: &str) -> Result<Column, PolarsError>
Get a column reference by name (for building expressions). Respects case sensitivity: when false, “Age” resolves to column “age” if present.
Sourcepub fn with_column(
&self,
column_name: &str,
col: &Column,
) -> Result<DataFrame, PolarsError>
pub fn with_column( &self, column_name: &str, col: &Column, ) -> Result<DataFrame, PolarsError>
Add or replace a column. Use a Column (e.g. from col("x"), rand(42), randn(42)).
For rand/randn, generates one distinct value per row (PySpark-like).
Sourcepub fn with_column_expr(
&self,
column_name: &str,
expr: Expr,
) -> Result<DataFrame, PolarsError>
pub fn with_column_expr( &self, column_name: &str, expr: Expr, ) -> Result<DataFrame, PolarsError>
Add or replace a column using an expression. Prefer with_column with a Column for rand/randn (per-row values).
Sourcepub fn group_by(
&self,
column_names: Vec<&str>,
) -> Result<GroupedData, PolarsError>
pub fn group_by( &self, column_names: Vec<&str>, ) -> Result<GroupedData, PolarsError>
Group by columns (returns GroupedData for aggregation). Column names are resolved according to case sensitivity.
Sourcepub fn cube(
&self,
column_names: Vec<&str>,
) -> Result<CubeRollupData, PolarsError>
pub fn cube( &self, column_names: Vec<&str>, ) -> Result<CubeRollupData, PolarsError>
Cube: multiple grouping sets (all subsets of columns), then union (PySpark cube).
Sourcepub fn rollup(
&self,
column_names: Vec<&str>,
) -> Result<CubeRollupData, PolarsError>
pub fn rollup( &self, column_names: Vec<&str>, ) -> Result<CubeRollupData, PolarsError>
Rollup: grouping sets (prefixes of columns), then union (PySpark rollup).
Sourcepub fn join(
&self,
other: &DataFrame,
on: Vec<&str>,
how: JoinType,
) -> Result<DataFrame, PolarsError>
pub fn join( &self, other: &DataFrame, on: Vec<&str>, how: JoinType, ) -> Result<DataFrame, PolarsError>
Join with another DataFrame on the given columns. Join column names are resolved on the left (and right must have matching names).
Sourcepub fn order_by(
&self,
column_names: Vec<&str>,
ascending: Vec<bool>,
) -> Result<DataFrame, PolarsError>
pub fn order_by( &self, column_names: Vec<&str>, ascending: Vec<bool>, ) -> Result<DataFrame, PolarsError>
Order by columns (sort). Column names are resolved according to case sensitivity.
Sourcepub fn order_by_exprs(
&self,
sort_orders: Vec<SortOrder>,
) -> Result<DataFrame, PolarsError>
pub fn order_by_exprs( &self, sort_orders: Vec<SortOrder>, ) -> Result<DataFrame, PolarsError>
Order by sort expressions (asc/desc with nulls_first/last).
Sourcepub fn union(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn union(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Union (unionAll): stack another DataFrame vertically. Schemas must match (same columns, same order).
Sourcepub fn union_by_name(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn union_by_name(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Union by name: stack vertically, aligning columns by name.
Sourcepub fn distinct(
&self,
subset: Option<Vec<&str>>,
) -> Result<DataFrame, PolarsError>
pub fn distinct( &self, subset: Option<Vec<&str>>, ) -> Result<DataFrame, PolarsError>
Distinct: drop duplicate rows (all columns or optional subset).
Sourcepub fn drop(&self, columns: Vec<&str>) -> Result<DataFrame, PolarsError>
pub fn drop(&self, columns: Vec<&str>) -> Result<DataFrame, PolarsError>
Drop one or more columns.
Sourcepub fn dropna(
&self,
subset: Option<Vec<&str>>,
) -> Result<DataFrame, PolarsError>
pub fn dropna( &self, subset: Option<Vec<&str>>, ) -> Result<DataFrame, PolarsError>
Drop rows with nulls (all columns or optional subset).
Sourcepub fn fillna(&self, value: Expr) -> Result<DataFrame, PolarsError>
pub fn fillna(&self, value: Expr) -> Result<DataFrame, PolarsError>
Fill nulls with a literal expression (applied to all columns).
Sourcepub fn with_column_renamed(
&self,
old_name: &str,
new_name: &str,
) -> Result<DataFrame, PolarsError>
pub fn with_column_renamed( &self, old_name: &str, new_name: &str, ) -> Result<DataFrame, PolarsError>
Rename a column (old_name -> new_name).
Sourcepub fn replace(
&self,
column_name: &str,
old_value: Expr,
new_value: Expr,
) -> Result<DataFrame, PolarsError>
pub fn replace( &self, column_name: &str, old_value: Expr, new_value: Expr, ) -> Result<DataFrame, PolarsError>
Replace values in a column (old_value -> new_value). PySpark replace.
Sourcepub fn cross_join(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn cross_join(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Cross join with another DataFrame (cartesian product). PySpark crossJoin.
Sourcepub fn describe(&self) -> Result<DataFrame, PolarsError>
pub fn describe(&self) -> Result<DataFrame, PolarsError>
Summary statistics. PySpark describe.
Sourcepub fn cache(&self) -> Result<DataFrame, PolarsError>
pub fn cache(&self) -> Result<DataFrame, PolarsError>
No-op: execution is eager by default. PySpark cache.
Sourcepub fn persist(&self) -> Result<DataFrame, PolarsError>
pub fn persist(&self) -> Result<DataFrame, PolarsError>
No-op: execution is eager by default. PySpark persist.
Sourcepub fn unpersist(&self) -> Result<DataFrame, PolarsError>
pub fn unpersist(&self) -> Result<DataFrame, PolarsError>
No-op. PySpark unpersist.
Sourcepub fn subtract(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn subtract(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Set difference: rows in self not in other. PySpark subtract / except.
Sourcepub fn intersect(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn intersect(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Set intersection: rows in both self and other. PySpark intersect.
Sourcepub fn sample(
&self,
with_replacement: bool,
fraction: f64,
seed: Option<u64>,
) -> Result<DataFrame, PolarsError>
pub fn sample( &self, with_replacement: bool, fraction: f64, seed: Option<u64>, ) -> Result<DataFrame, PolarsError>
Sample a fraction of rows. PySpark sample(withReplacement, fraction, seed).
Sourcepub fn random_split(
&self,
weights: &[f64],
seed: Option<u64>,
) -> Result<Vec<DataFrame>, PolarsError>
pub fn random_split( &self, weights: &[f64], seed: Option<u64>, ) -> Result<Vec<DataFrame>, PolarsError>
Split into multiple DataFrames by weights. PySpark randomSplit(weights, seed).
Sourcepub fn sample_by(
&self,
col_name: &str,
fractions: &[(Expr, f64)],
seed: Option<u64>,
) -> Result<DataFrame, PolarsError>
pub fn sample_by( &self, col_name: &str, fractions: &[(Expr, f64)], seed: Option<u64>, ) -> Result<DataFrame, PolarsError>
Stratified sample by column value. PySpark sampleBy(col, fractions, seed). fractions: list of (value as Expr, fraction) for that stratum.
Sourcepub fn first(&self) -> Result<DataFrame, PolarsError>
pub fn first(&self) -> Result<DataFrame, PolarsError>
First row as a one-row DataFrame. PySpark first().
Sourcepub fn take(&self, n: usize) -> Result<DataFrame, PolarsError>
pub fn take(&self, n: usize) -> Result<DataFrame, PolarsError>
Take first n rows. PySpark take(n).
Sourcepub fn to_df(&self, names: Vec<&str>) -> Result<DataFrame, PolarsError>
pub fn to_df(&self, names: Vec<&str>) -> Result<DataFrame, PolarsError>
Rename columns. PySpark toDF(*colNames).
Sourcepub fn stat(&self) -> DataFrameStat<'_>
pub fn stat(&self) -> DataFrameStat<'_>
Statistical helper. PySpark df.stat().cov / .corr.
Sourcepub fn corr(&self) -> Result<DataFrame, PolarsError>
pub fn corr(&self) -> Result<DataFrame, PolarsError>
Correlation matrix of all numeric columns. PySpark df.corr() returns a DataFrame of pairwise correlations.
Sourcepub fn corr_cols(&self, col1: &str, col2: &str) -> Result<f64, PolarsError>
pub fn corr_cols(&self, col1: &str, col2: &str) -> Result<f64, PolarsError>
Pearson correlation between two columns (scalar). PySpark df.corr(col1, col2).
Sourcepub fn cov_cols(&self, col1: &str, col2: &str) -> Result<f64, PolarsError>
pub fn cov_cols(&self, col1: &str, col2: &str) -> Result<f64, PolarsError>
Sample covariance between two columns (scalar). PySpark df.cov(col1, col2).
Sourcepub fn summary(&self) -> Result<DataFrame, PolarsError>
pub fn summary(&self) -> Result<DataFrame, PolarsError>
Summary statistics (alias for describe). PySpark summary.
Sourcepub fn to_json(&self) -> Result<Vec<String>, PolarsError>
pub fn to_json(&self) -> Result<Vec<String>, PolarsError>
Collect rows as JSON strings (one per row). PySpark toJSON.
Sourcepub fn print_schema(&self) -> Result<String, PolarsError>
pub fn print_schema(&self) -> Result<String, PolarsError>
Return schema as tree string. PySpark printSchema (returns string; print to stdout if needed).
Sourcepub fn checkpoint(&self) -> Result<DataFrame, PolarsError>
pub fn checkpoint(&self) -> Result<DataFrame, PolarsError>
No-op: Polars backend is eager. PySpark checkpoint.
Sourcepub fn local_checkpoint(&self) -> Result<DataFrame, PolarsError>
pub fn local_checkpoint(&self) -> Result<DataFrame, PolarsError>
No-op: Polars backend is eager. PySpark localCheckpoint.
Sourcepub fn repartition(
&self,
_num_partitions: usize,
) -> Result<DataFrame, PolarsError>
pub fn repartition( &self, _num_partitions: usize, ) -> Result<DataFrame, PolarsError>
No-op: single partition in Polars. PySpark repartition(n).
Sourcepub fn repartition_by_range(
&self,
_num_partitions: usize,
_cols: Vec<&str>,
) -> Result<DataFrame, PolarsError>
pub fn repartition_by_range( &self, _num_partitions: usize, _cols: Vec<&str>, ) -> Result<DataFrame, PolarsError>
No-op: Polars has no range partitioning. PySpark repartitionByRange(n, cols).
Sourcepub fn dtypes(&self) -> Result<Vec<(String, String)>, PolarsError>
pub fn dtypes(&self) -> Result<Vec<(String, String)>, PolarsError>
Column names and dtype strings. PySpark dtypes. Returns (name, dtype_string) per column.
Sourcepub fn sort_within_partitions(
&self,
_cols: &[SortOrder],
) -> Result<DataFrame, PolarsError>
pub fn sort_within_partitions( &self, _cols: &[SortOrder], ) -> Result<DataFrame, PolarsError>
No-op: we don’t model partitions. PySpark sortWithinPartitions. Same as orderBy for compatibility.
Sourcepub fn coalesce(&self, _num_partitions: usize) -> Result<DataFrame, PolarsError>
pub fn coalesce(&self, _num_partitions: usize) -> Result<DataFrame, PolarsError>
No-op: single partition in Polars. PySpark coalesce(n).
Sourcepub fn hint(
&self,
_name: &str,
_params: &[i32],
) -> Result<DataFrame, PolarsError>
pub fn hint( &self, _name: &str, _params: &[i32], ) -> Result<DataFrame, PolarsError>
No-op. PySpark hint (query planner hint).
Sourcepub fn input_files(&self) -> Vec<String>
pub fn input_files(&self) -> Vec<String>
Returns empty vec (no file sources). PySpark inputFiles.
Sourcepub fn same_semantics(&self, _other: &DataFrame) -> bool
pub fn same_semantics(&self, _other: &DataFrame) -> bool
No-op; returns false. PySpark sameSemantics.
Sourcepub fn semantic_hash(&self) -> u64
pub fn semantic_hash(&self) -> u64
No-op; returns 0. PySpark semanticHash.
Sourcepub fn observe(
&self,
_name: &str,
_expr: Expr,
) -> Result<DataFrame, PolarsError>
pub fn observe( &self, _name: &str, _expr: Expr, ) -> Result<DataFrame, PolarsError>
No-op. PySpark observe (metrics).
Sourcepub fn with_watermark(
&self,
_event_time: &str,
_delay: &str,
) -> Result<DataFrame, PolarsError>
pub fn with_watermark( &self, _event_time: &str, _delay: &str, ) -> Result<DataFrame, PolarsError>
No-op. PySpark withWatermark (streaming).
Sourcepub fn select_expr(&self, exprs: &[String]) -> Result<DataFrame, PolarsError>
pub fn select_expr(&self, exprs: &[String]) -> Result<DataFrame, PolarsError>
Select by expression strings (minimal: column names, optionally “col as alias”). PySpark selectExpr.
Sourcepub fn col_regex(&self, pattern: &str) -> Result<DataFrame, PolarsError>
pub fn col_regex(&self, pattern: &str) -> Result<DataFrame, PolarsError>
Select columns whose names match the regex. PySpark colRegex.
Sourcepub fn with_columns(
&self,
exprs: &[(String, Column)],
) -> Result<DataFrame, PolarsError>
pub fn with_columns( &self, exprs: &[(String, Column)], ) -> Result<DataFrame, PolarsError>
Add or replace multiple columns. PySpark withColumns. Accepts Column so rand/randn get per-row values.
Sourcepub fn with_columns_renamed(
&self,
renames: &[(String, String)],
) -> Result<DataFrame, PolarsError>
pub fn with_columns_renamed( &self, renames: &[(String, String)], ) -> Result<DataFrame, PolarsError>
Rename multiple columns. PySpark withColumnsRenamed.
Sourcepub fn na(&self) -> DataFrameNa<'_>
pub fn na(&self) -> DataFrameNa<'_>
NA sub-API. PySpark df.na().
Sourcepub fn offset(&self, n: usize) -> Result<DataFrame, PolarsError>
pub fn offset(&self, n: usize) -> Result<DataFrame, PolarsError>
Skip first n rows. PySpark offset(n).
Sourcepub fn transform<F>(&self, f: F) -> Result<DataFrame, PolarsError>
pub fn transform<F>(&self, f: F) -> Result<DataFrame, PolarsError>
Transform by a function. PySpark transform(func).
Sourcepub fn freq_items(
&self,
columns: &[&str],
support: f64,
) -> Result<DataFrame, PolarsError>
pub fn freq_items( &self, columns: &[&str], support: f64, ) -> Result<DataFrame, PolarsError>
Frequent items. PySpark freqItems (stub).
Sourcepub fn approx_quantile(
&self,
column: &str,
probabilities: &[f64],
) -> Result<DataFrame, PolarsError>
pub fn approx_quantile( &self, column: &str, probabilities: &[f64], ) -> Result<DataFrame, PolarsError>
Approximate quantiles. PySpark approxQuantile (stub).
Sourcepub fn crosstab(&self, col1: &str, col2: &str) -> Result<DataFrame, PolarsError>
pub fn crosstab(&self, col1: &str, col2: &str) -> Result<DataFrame, PolarsError>
Cross-tabulation. PySpark crosstab (stub).
Sourcepub fn melt(
&self,
id_vars: &[&str],
value_vars: &[&str],
) -> Result<DataFrame, PolarsError>
pub fn melt( &self, id_vars: &[&str], value_vars: &[&str], ) -> Result<DataFrame, PolarsError>
Unpivot (melt). PySpark melt (stub).
Sourcepub fn pivot(
&self,
_pivot_col: &str,
_values: Option<Vec<&str>>,
) -> Result<DataFrame, PolarsError>
pub fn pivot( &self, _pivot_col: &str, _values: Option<Vec<&str>>, ) -> Result<DataFrame, PolarsError>
Pivot (wide format). PySpark pivot. Stub: not yet implemented; use crosstab for two-column count.
Sourcepub fn except_all(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn except_all(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Set difference keeping duplicates. PySpark exceptAll.
Sourcepub fn intersect_all(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
pub fn intersect_all(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>
Set intersection keeping duplicates. PySpark intersectAll.
Sourcepub fn write_delta(
&self,
_path: impl AsRef<Path>,
_overwrite: bool,
) -> Result<(), PolarsError>
pub fn write_delta( &self, _path: impl AsRef<Path>, _overwrite: bool, ) -> Result<(), PolarsError>
Stub when delta feature is disabled.
Sourcepub fn save_as_delta_table(&self, session: &SparkSession, name: &str)
pub fn save_as_delta_table(&self, session: &SparkSession, name: &str)
Register this DataFrame as an in-memory “delta table” by name (same namespace as saveAsTable). Readable via read_delta(name) or table(name).
Sourcepub fn write(&self) -> DataFrameWriter<'_>
pub fn write(&self) -> DataFrameWriter<'_>
Return a writer for generic format (parquet, csv, json). PySpark-style write API.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for DataFrame
impl !RefUnwindSafe for DataFrame
impl Send for DataFrame
impl Sync for DataFrame
impl Unpin for DataFrame
impl !UnwindSafe for DataFrame
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more