Skip to main content

DataFrame

Struct DataFrame 

Source
pub struct DataFrame { /* private fields */ }
Expand description

DataFrame - main tabular data structure. Thin wrapper around an eager Polars DataFrame.

Implementations§

Source§

impl DataFrame

Source

pub fn from_polars(df: PlDataFrame) -> Self

Create a new DataFrame from a Polars DataFrame (case-insensitive column matching by default).

Source

pub fn from_polars_with_options(df: PlDataFrame, case_sensitive: bool) -> Self

Create a new DataFrame from a Polars DataFrame with explicit case sensitivity. When case_sensitive is false, column resolution is case-insensitive (PySpark default).

Source

pub fn empty() -> Self

Create an empty DataFrame

Source

pub fn resolve_expr_column_names(&self, expr: Expr) -> Result<Expr, PolarsError>

Resolve column names in a Polars expression against this DataFrame’s schema. When case_sensitive is false, column references (e.g. col(“name”)) are resolved case-insensitively (PySpark default). Use before filter/select_with_exprs/order_by_exprs. Names that appear as alias outputs (e.g. in expr.alias(“partial”)) are not resolved as input columns, so select(col(“x”).substr(1, 3).alias(“partial”)), when().then().otherwise().alias(“result”), and col(“x”).rank().over([]).alias(“rank”) work (issues #200, #212).

Source

pub fn coerce_string_numeric_comparisons( &self, expr: Expr, ) -> Result<Expr, PolarsError>

Rewrite comparison expressions to apply PySpark-style type coercion.

This walks the expression tree and, for comparison operators where one side is a column and the other is a numeric literal, delegates to coerce_for_pyspark_comparison so that string–numeric comparisons behave like PySpark (string values parsed to numbers where possible, invalid strings treated as null/non-matching).

Source

pub fn resolve_column_name(&self, name: &str) -> Result<String, PolarsError>

Resolve a logical column name to the actual column name in the schema. When case_sensitive is false, matches case-insensitively.

Source

pub fn schema(&self) -> Result<StructType, PolarsError>

Get the schema of the DataFrame

Source

pub fn columns(&self) -> Result<Vec<String>, PolarsError>

Get column names

Source

pub fn count(&self) -> Result<usize, PolarsError>

Count the number of rows (action - triggers execution)

Source

pub fn show(&self, n: Option<usize>) -> Result<(), PolarsError>

Show the first n rows

Source

pub fn collect(&self) -> Result<Arc<PlDataFrame>, PolarsError>

Collect the DataFrame (action - triggers execution)

Source

pub fn collect_as_json_rows( &self, ) -> Result<Vec<HashMap<String, JsonValue>>, PolarsError>

Collect as rows of column-name -> JSON value. For use by language bindings (Node, etc.).

Source

pub fn select_exprs(&self, exprs: Vec<Expr>) -> Result<DataFrame, PolarsError>

Select columns (returns a new DataFrame). Accepts either column names (strings) or Column expressions (e.g. from regexp_extract_all(…).alias(“m”)). Column names are resolved according to case sensitivity.

Source

pub fn select(&self, cols: Vec<&str>) -> Result<DataFrame, PolarsError>

Select columns by name (returns a new DataFrame). Column names are resolved according to case sensitivity.

Source

pub fn filter(&self, condition: Expr) -> Result<DataFrame, PolarsError>

Filter rows using a Polars expression.

Source

pub fn column(&self, name: &str) -> Result<Column, PolarsError>

Get a column reference by name (for building expressions). Respects case sensitivity: when false, “Age” resolves to column “age” if present.

Source

pub fn with_column( &self, column_name: &str, col: &Column, ) -> Result<DataFrame, PolarsError>

Add or replace a column. Use a Column (e.g. from col("x"), rand(42), randn(42)). For rand/randn, generates one distinct value per row (PySpark-like).

Source

pub fn with_column_expr( &self, column_name: &str, expr: Expr, ) -> Result<DataFrame, PolarsError>

Add or replace a column using an expression. Prefer with_column with a Column for rand/randn (per-row values).

Source

pub fn group_by( &self, column_names: Vec<&str>, ) -> Result<GroupedData, PolarsError>

Group by columns (returns GroupedData for aggregation). Column names are resolved according to case sensitivity.

Source

pub fn cube( &self, column_names: Vec<&str>, ) -> Result<CubeRollupData, PolarsError>

Cube: multiple grouping sets (all subsets of columns), then union (PySpark cube).

Source

pub fn rollup( &self, column_names: Vec<&str>, ) -> Result<CubeRollupData, PolarsError>

Rollup: grouping sets (prefixes of columns), then union (PySpark rollup).

Source

pub fn join( &self, other: &DataFrame, on: Vec<&str>, how: JoinType, ) -> Result<DataFrame, PolarsError>

Join with another DataFrame on the given columns. Join column names are resolved on the left (and right must have matching names).

Source

pub fn order_by( &self, column_names: Vec<&str>, ascending: Vec<bool>, ) -> Result<DataFrame, PolarsError>

Order by columns (sort). Column names are resolved according to case sensitivity.

Source

pub fn order_by_exprs( &self, sort_orders: Vec<SortOrder>, ) -> Result<DataFrame, PolarsError>

Order by sort expressions (asc/desc with nulls_first/last).

Source

pub fn union(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Union (unionAll): stack another DataFrame vertically. Schemas must match (same columns, same order).

Source

pub fn union_by_name(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Union by name: stack vertically, aligning columns by name.

Source

pub fn distinct( &self, subset: Option<Vec<&str>>, ) -> Result<DataFrame, PolarsError>

Distinct: drop duplicate rows (all columns or optional subset).

Source

pub fn drop(&self, columns: Vec<&str>) -> Result<DataFrame, PolarsError>

Drop one or more columns.

Source

pub fn dropna( &self, subset: Option<Vec<&str>>, ) -> Result<DataFrame, PolarsError>

Drop rows with nulls (all columns or optional subset).

Source

pub fn fillna(&self, value: Expr) -> Result<DataFrame, PolarsError>

Fill nulls with a literal expression (applied to all columns).

Source

pub fn limit(&self, n: usize) -> Result<DataFrame, PolarsError>

Limit: return first n rows.

Source

pub fn with_column_renamed( &self, old_name: &str, new_name: &str, ) -> Result<DataFrame, PolarsError>

Rename a column (old_name -> new_name).

Source

pub fn replace( &self, column_name: &str, old_value: Expr, new_value: Expr, ) -> Result<DataFrame, PolarsError>

Replace values in a column (old_value -> new_value). PySpark replace.

Source

pub fn cross_join(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Cross join with another DataFrame (cartesian product). PySpark crossJoin.

Source

pub fn describe(&self) -> Result<DataFrame, PolarsError>

Summary statistics. PySpark describe.

Source

pub fn cache(&self) -> Result<DataFrame, PolarsError>

No-op: execution is eager by default. PySpark cache.

Source

pub fn persist(&self) -> Result<DataFrame, PolarsError>

No-op: execution is eager by default. PySpark persist.

Source

pub fn unpersist(&self) -> Result<DataFrame, PolarsError>

No-op. PySpark unpersist.

Source

pub fn subtract(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Set difference: rows in self not in other. PySpark subtract / except.

Source

pub fn intersect(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Set intersection: rows in both self and other. PySpark intersect.

Source

pub fn sample( &self, with_replacement: bool, fraction: f64, seed: Option<u64>, ) -> Result<DataFrame, PolarsError>

Sample a fraction of rows. PySpark sample(withReplacement, fraction, seed).

Source

pub fn random_split( &self, weights: &[f64], seed: Option<u64>, ) -> Result<Vec<DataFrame>, PolarsError>

Split into multiple DataFrames by weights. PySpark randomSplit(weights, seed).

Source

pub fn sample_by( &self, col_name: &str, fractions: &[(Expr, f64)], seed: Option<u64>, ) -> Result<DataFrame, PolarsError>

Stratified sample by column value. PySpark sampleBy(col, fractions, seed). fractions: list of (value as Expr, fraction) for that stratum.

Source

pub fn first(&self) -> Result<DataFrame, PolarsError>

First row as a one-row DataFrame. PySpark first().

Source

pub fn head(&self, n: usize) -> Result<DataFrame, PolarsError>

First n rows. PySpark head(n).

Source

pub fn take(&self, n: usize) -> Result<DataFrame, PolarsError>

Take first n rows. PySpark take(n).

Source

pub fn tail(&self, n: usize) -> Result<DataFrame, PolarsError>

Last n rows. PySpark tail(n).

Source

pub fn is_empty(&self) -> bool

True if the DataFrame has zero rows. PySpark isEmpty.

Source

pub fn to_df(&self, names: Vec<&str>) -> Result<DataFrame, PolarsError>

Rename columns. PySpark toDF(*colNames).

Source

pub fn stat(&self) -> DataFrameStat<'_>

Statistical helper. PySpark df.stat().cov / .corr.

Source

pub fn corr(&self) -> Result<DataFrame, PolarsError>

Correlation matrix of all numeric columns. PySpark df.corr() returns a DataFrame of pairwise correlations.

Source

pub fn corr_cols(&self, col1: &str, col2: &str) -> Result<f64, PolarsError>

Pearson correlation between two columns (scalar). PySpark df.corr(col1, col2).

Source

pub fn cov_cols(&self, col1: &str, col2: &str) -> Result<f64, PolarsError>

Sample covariance between two columns (scalar). PySpark df.cov(col1, col2).

Source

pub fn summary(&self) -> Result<DataFrame, PolarsError>

Summary statistics (alias for describe). PySpark summary.

Source

pub fn to_json(&self) -> Result<Vec<String>, PolarsError>

Collect rows as JSON strings (one per row). PySpark toJSON.

Source

pub fn explain(&self) -> String

Return execution plan description. PySpark explain.

Source

pub fn print_schema(&self) -> Result<String, PolarsError>

Return schema as tree string. PySpark printSchema (returns string; print to stdout if needed).

Source

pub fn checkpoint(&self) -> Result<DataFrame, PolarsError>

No-op: Polars backend is eager. PySpark checkpoint.

Source

pub fn local_checkpoint(&self) -> Result<DataFrame, PolarsError>

No-op: Polars backend is eager. PySpark localCheckpoint.

Source

pub fn repartition( &self, _num_partitions: usize, ) -> Result<DataFrame, PolarsError>

No-op: single partition in Polars. PySpark repartition(n).

Source

pub fn repartition_by_range( &self, _num_partitions: usize, _cols: Vec<&str>, ) -> Result<DataFrame, PolarsError>

No-op: Polars has no range partitioning. PySpark repartitionByRange(n, cols).

Source

pub fn dtypes(&self) -> Result<Vec<(String, String)>, PolarsError>

Column names and dtype strings. PySpark dtypes. Returns (name, dtype_string) per column.

Source

pub fn sort_within_partitions( &self, _cols: &[SortOrder], ) -> Result<DataFrame, PolarsError>

No-op: we don’t model partitions. PySpark sortWithinPartitions. Same as orderBy for compatibility.

Source

pub fn coalesce(&self, _num_partitions: usize) -> Result<DataFrame, PolarsError>

No-op: single partition in Polars. PySpark coalesce(n).

Source

pub fn hint( &self, _name: &str, _params: &[i32], ) -> Result<DataFrame, PolarsError>

No-op. PySpark hint (query planner hint).

Source

pub fn is_local(&self) -> bool

Returns true (eager single-node). PySpark isLocal.

Source

pub fn input_files(&self) -> Vec<String>

Returns empty vec (no file sources). PySpark inputFiles.

Source

pub fn same_semantics(&self, _other: &DataFrame) -> bool

No-op; returns false. PySpark sameSemantics.

Source

pub fn semantic_hash(&self) -> u64

No-op; returns 0. PySpark semanticHash.

Source

pub fn observe( &self, _name: &str, _expr: Expr, ) -> Result<DataFrame, PolarsError>

No-op. PySpark observe (metrics).

Source

pub fn with_watermark( &self, _event_time: &str, _delay: &str, ) -> Result<DataFrame, PolarsError>

No-op. PySpark withWatermark (streaming).

Source

pub fn select_expr(&self, exprs: &[String]) -> Result<DataFrame, PolarsError>

Select by expression strings (minimal: column names, optionally “col as alias”). PySpark selectExpr.

Source

pub fn col_regex(&self, pattern: &str) -> Result<DataFrame, PolarsError>

Select columns whose names match the regex. PySpark colRegex.

Source

pub fn with_columns( &self, exprs: &[(String, Column)], ) -> Result<DataFrame, PolarsError>

Add or replace multiple columns. PySpark withColumns. Accepts Column so rand/randn get per-row values.

Source

pub fn with_columns_renamed( &self, renames: &[(String, String)], ) -> Result<DataFrame, PolarsError>

Rename multiple columns. PySpark withColumnsRenamed.

Source

pub fn na(&self) -> DataFrameNa<'_>

NA sub-API. PySpark df.na().

Source

pub fn offset(&self, n: usize) -> Result<DataFrame, PolarsError>

Skip first n rows. PySpark offset(n).

Source

pub fn transform<F>(&self, f: F) -> Result<DataFrame, PolarsError>

Transform by a function. PySpark transform(func).

Source

pub fn freq_items( &self, columns: &[&str], support: f64, ) -> Result<DataFrame, PolarsError>

Frequent items. PySpark freqItems (stub).

Source

pub fn approx_quantile( &self, column: &str, probabilities: &[f64], ) -> Result<DataFrame, PolarsError>

Approximate quantiles. PySpark approxQuantile (stub).

Source

pub fn crosstab(&self, col1: &str, col2: &str) -> Result<DataFrame, PolarsError>

Cross-tabulation. PySpark crosstab (stub).

Source

pub fn melt( &self, id_vars: &[&str], value_vars: &[&str], ) -> Result<DataFrame, PolarsError>

Unpivot (melt). PySpark melt (stub).

Source

pub fn pivot( &self, _pivot_col: &str, _values: Option<Vec<&str>>, ) -> Result<DataFrame, PolarsError>

Pivot (wide format). PySpark pivot. Stub: not yet implemented; use crosstab for two-column count.

Source

pub fn except_all(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Set difference keeping duplicates. PySpark exceptAll.

Source

pub fn intersect_all(&self, other: &DataFrame) -> Result<DataFrame, PolarsError>

Set intersection keeping duplicates. PySpark intersectAll.

Source

pub fn write_delta( &self, _path: impl AsRef<Path>, _overwrite: bool, ) -> Result<(), PolarsError>

Stub when delta feature is disabled.

Source

pub fn save_as_delta_table(&self, session: &SparkSession, name: &str)

Register this DataFrame as an in-memory “delta table” by name (same namespace as saveAsTable). Readable via read_delta(name) or table(name).

Source

pub fn write(&self) -> DataFrameWriter<'_>

Return a writer for generic format (parquet, csv, json). PySpark-style write API.

Trait Implementations§

Source§

impl Clone for DataFrame

Source§

fn clone(&self) -> Self

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> DynClone for T
where T: Clone,

Source§

fn __clone_box(&self, _: Private) -> *mut ()

Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V