Expand description
Robin Sparkless - A Rust DataFrame library with PySpark-like API
This library provides a PySpark-compatible API. The root crate is engine-agnostic: it depends on robin-sparkless-core (types, expression IR, config) and one backend—currently robin-sparkless-polars, which uses Polars for execution. The public API exposes engine-agnostic expression types where possible.
§Expression APIs
-
ExprIr (engine-agnostic): Use
col,lit_i64,lit_str,when,gt,eq, etc. from the crate root (re-exported fromrobin_sparkless_core). These build anExprIrtree. UseDataFrame::filter_expr_ir,DataFrame::select_expr_ir,DataFrame::with_column_expr_ir,DataFrame::collect_rows, andGroupedData::agg_expr_irwith&ExprIr/&[ExprIr]. Collect returnsCollectedRows(JSON-like rows). Prefer this for new code and embeddings. -
Column / Expr (Polars-backed): Use
preludeorrobin_sparkless::functions::{col, lit_i64, ...}for the full PySpark-like API that returnsColumnand uses PolarsExpr. UseDataFrame::filter,DataFrame::with_column,DataFrame::select_exprs, etc. with those types. Still supported for compatibility and advanced use.
§Getting started and embedding
For application code and embedding, use the prelude: use robin_sparkless::prelude::*.
For a minimal FFI surface, use prelude::embed. For engine-agnostic expressions, use the
root re-exports (col, lit_i64, gt, etc.) and the *_expr_ir / collect_rows methods.
§Panics and errors
Some functions panic when used with invalid or empty inputs (e.g. calling
when(cond).otherwise(val) without .then(), or passing no columns to
format_string, elt, concat, coalesce, or named_struct in Rust).
In Rust, create_map and array return Result for empty input instead of
panicking. From Python, empty columns for coalesce, format_string,
printf, and named_struct raise ValueError. See the documentation for
each function for details.
§API stability
While the crate is in the 0.x series, we follow semver but may introduce breaking changes in minor releases (e.g. 0.1 → 0.2) until 1.0. For behavioral caveats and intentional differences from PySpark, see the repository documentation.
Re-exports§
pub use dataframe::CubeRollupData;pub use dataframe::DataFrame;pub use dataframe::DataFrameNa;pub use dataframe::DataFrameStat;pub use dataframe::DataFrameWriter;pub use dataframe::GroupedData;pub use dataframe::PivotedGroupedData;pub use session::DataFrameReader;pub use session::SparkSession;pub use session::SparkSessionBuilder;pub use traits::FromRobinDf;pub use traits::IntoRobinDf;
Modules§
- column
- config
- Configuration for robin-sparkless sessions.
- dataframe
- Root-owned DataFrame API; delegates to robin-sparkless-polars for execution.
- error
- Engine error type for embedders (Polars conversion in this crate).
- expression
- functions
- Expression builders: sort, types, column refs/literals, agg, when, string, datetime, struct/map/array, cast, hash, misc.
Re-exports all public items so
use crate::functions::*andpub use functions::*keep the same API. - plan
- Plan execution; use
execute_planfor root session/DataFrame. Re-exports plan error types. - prelude
- One-stop prelude for application code and embedding.
- schema
- Schema types and Polars conversion. Types from core; conversion from robin-sparkless-polars.
- session
- Root-owned Session API; delegates to robin-sparkless-polars for execution.
- sql
- SQL parsing and execution; returns root-owned DataFrame.
- traits
- Traits for conversion to and from root-owned DataFrames.
- type_
coercion
Structs§
- Chained
When Builder - Builder for an additional when-then clause (returned by ThenBuilder::when).
- Column
- Column - represents a column in a DataFrame, used for building expressions
Thin wrapper around Polars
Expr. May carry a DeferredRandom for rand/randn so with_column can produce one value per row. May carry UdfCall for Python UDFs (eager execution at with_column). - Plan
Expr Error - Error from parsing or interpreting a plan expression.
- Sort
Order - Sort order specification for use in orderBy/sort. Holds expr + direction + null placement.
- Sparkless
Config - Configuration for building a SparkSession.
- Struct
Field - Struct
Type - Then
Builder - Builder for chaining when-then clauses before finalizing with otherwise
- UdfRegistry
- Session-scoped UDF registry. Rust UDFs run lazily via Polars Expr::map.
- When
Builder - When-then-otherwise builder.
- When
Then Builder
Enums§
- Data
Type - Engine
Error - Unified error type for robin-sparkless operations.
- ExprIr
- Expression IR: a single, serializable tree that backends convert to their native Expr.
- Join
Type - Re-export for API compatibility. Join type for DataFrame joins (PySpark-compatible)
- Literal
Value - Literal value in an expression (engine-agnostic).
- Plan
Error - Errors from plan execution.
- Polars
Error - Save
Mode - Re-export for API compatibility. Save mode for saveAsTable (PySpark default is ErrorIfExists).
- Select
Item - Re-export for API compatibility. Select item: either a column name (str) or an expression (PySpark parity: select(“a”, col(“b”).alias(“x”))). Fixes #645: select expects Column or str.
- Write
Format - Re-export for API compatibility. Output format for generic write (PySpark DataFrameWriter.format).
- Write
Mode - Re-export for API compatibility. Write mode: overwrite or append (PySpark DataFrameWriter.mode for path-based save).
Traits§
- RustUdf
- Rust UDF: takes columns as Series, returns one Series. Used via Expr::map / map_many.
- Struct
Type Polars Ext - Extension trait for Polars schema conversion. Implemented for
StructTypefrom core. Bring this trait into scope to useStructType::from_polars_schemaandto_polars_schema.
Functions§
- abs
- Absolute value (PySpark abs)
- acos
- Arc cosine (PySpark acos)
- acosh
- Inverse hyperbolic cosine (PySpark acosh).
- add_
months - Add n months to date column (PySpark add_months).
- aes_
decrypt - AES decrypt (PySpark aes_decrypt). Input hex(nonce||ciphertext).
- aes_
encrypt - AES encrypt (PySpark aes_encrypt). Key as string; AES-128-GCM.
- aggregate
- Array fold/aggregate (PySpark aggregate). Simplified: zero + sum(list elements).
- alias
- Alias an expression with a new output name.
- and_
- any_
value - Any value from the group (PySpark any_value). Use in groupBy.agg(). ignorenulls reserved for API compatibility.
- approx_
count_ distinct - Approximate count distinct (PySpark approx_count_distinct). Use in groupBy.agg(). rsd reserved for API compatibility; Polars uses exact n_unique.
- approx_
percentile - Approximate percentile (PySpark approx_percentile). Maps to quantile; percentage in 0.0..=1.0. accuracy reserved for API compatibility.
- array
- Create an array column from multiple columns (PySpark array). With no arguments, returns a column of empty arrays (one per row); PySpark parity.
- array_
agg - Collect to array (PySpark array_agg).
- array_
append - Append element to end of list (PySpark array_append).
- array_
compact - Remove null elements from list (PySpark array_compact).
- array_
contains - Check if list contains value (PySpark array_contains).
- array_
distinct - Distinct elements in list (PySpark array_distinct).
- array_
except - Elements in first array not in second (PySpark array_except).
- array_
exists - True if any list element satisfies the predicate (PySpark exists).
- array_
filter - Filter list elements by predicate (PySpark filter).
- array_
flatten - Flatten list of lists to one list (PySpark flatten). Not implemented.
- array_
forall - True if all list elements satisfy the predicate (PySpark forall).
- array_
insert - Insert element at 1-based position (PySpark array_insert).
- array_
intersect - Elements in both arrays (PySpark array_intersect).
- array_
join - Join list of strings with separator (PySpark array_join).
- array_
max - Maximum element in list (PySpark array_max).
- array_
mean - Mean of list elements (PySpark aggregate avg).
- array_
min - Minimum element in list (PySpark array_min).
- array_
position - 1-based index of first occurrence of value in list, or 0 if not found (PySpark array_position). Implemented via Polars list.eval with col(“”) as element.
- array_
prepend - Prepend element to start of list (PySpark array_prepend).
- array_
remove - New list with all elements equal to value removed (PySpark array_remove). Implemented via Polars list.eval + list.drop_nulls.
- array_
repeat - Repeat each element n times (PySpark array_repeat). Not implemented: would require list.eval with dynamic repeat.
- array_
size - Number of elements in list (PySpark size / array_size). Returns Int32.
- array_
slice - Slice list from 1-based start with optional length (PySpark slice).
- array_
sort - Sort list elements (PySpark array_sort).
- array_
sum - Sum of list elements (PySpark aggregate sum).
- array_
transform - Transform list elements by expression (PySpark transform).
- array_
union - Distinct elements from both arrays (PySpark array_union).
- arrays_
overlap - True if two arrays have any element in common (PySpark arrays_overlap).
- arrays_
zip - Zip arrays into array of structs (PySpark arrays_zip).
- asc
- Ascending sort, nulls first (Spark default for ASC).
- asc_
nulls_ first - Ascending sort, nulls first.
- asc_
nulls_ last - Ascending sort, nulls last.
- ascii
- ASCII value of first character (PySpark ascii). Returns Int32.
- asin
- Arc sine (PySpark asin)
- asinh
- Inverse hyperbolic sine (PySpark asinh).
- assert_
true - Assert that all boolean values are true; errors otherwise (PySpark assert_true). When err_msg is Some, it is used in the error message when assertion fails.
- atan
- Arc tangent (PySpark atan)
- atan2
- Two-argument arc tangent atan2(y, x) in radians (PySpark atan2)
- atanh
- Inverse hyperbolic tangent (PySpark atanh).
- avg
- Average aggregation
- base64
- Base64 encode string bytes (PySpark base64).
- between
- bin
- Convert integer to binary string (PySpark bin).
- bit_and
- Bitwise AND of two integer/boolean columns (PySpark bit_and).
- bit_
count - Count of set bits in the integer representation (PySpark bit_count).
- bit_get
- Alias for getbit (PySpark bit_get).
- bit_
length - Bit length of string (bytes * 8) (PySpark bit_length).
- bit_or
- Bitwise OR of two integer/boolean columns (PySpark bit_or).
- bit_xor
- Bitwise XOR of two integer/boolean columns (PySpark bit_xor).
- bitmap_
bit_ position - Map integral value (0–32767) to bit position for bitmap aggregates (PySpark bitmap_bit_position).
- bitmap_
bucket_ number - Bucket number for distributed bitmap (PySpark bitmap_bucket_number). value / 32768.
- bitmap_
construct_ agg - Aggregate: bitwise OR of bit positions into one bitmap binary (PySpark bitmap_construct_agg). Use in group_by(…).agg([bitmap_construct_agg(col)]).
- bitmap_
count - Count set bits in a bitmap binary column (PySpark bitmap_count).
- bitmap_
or_ agg - Aggregate: bitwise OR of bitmap binary column (PySpark bitmap_or_agg).
- bitwise_
not - Bitwise NOT of an integer/boolean column (PySpark bitwise_not / bitwiseNOT).
- bool_
and - Boolean AND across group (PySpark bool_and). Use in groupBy.agg(); column should be boolean.
- broadcast
- Broadcast hint - no-op that returns the same DataFrame (PySpark broadcast).
- bround
- Banker’s rounding - round half to even (PySpark bround).
- btrim
- Trim leading and trailing chars (PySpark btrim). trim_str defaults to whitespace.
- call
- Generic function call (for the long tail of functions).
- call_
udf - Call a registered UDF by name. PySpark: F.call_udf(udfName, *cols). Requires a session (set by get_or_create). Raises if UDF not found.
- cardinality
- Cardinality: number of elements in array (PySpark cardinality). Alias for size/array_size.
- cast
- Cast column to the given type (PySpark cast). Fails on invalid conversion. String-to-boolean uses custom parsing (“true”/“false”/“1”/“0”) since Polars does not support Utf8->Boolean. String-to-date accepts date and datetime strings (e.g. “2025-01-01 10:30:00” truncates to date) for Spark parity.
- cbrt
- Cube root (PySpark cbrt).
- ceil
- Ceiling (PySpark ceil)
- ceiling
- Alias for ceil. PySpark ceiling.
- char
- Int to single-character string (PySpark char). Valid codepoint only.
- char_
length - Length of string in characters (PySpark char_length). Alias of length().
- character_
length - Length of string in characters (PySpark character_length). Alias of length().
- chr
- Alias for char (PySpark chr).
- coalesce
- Coalesce - returns the first non-null value from multiple columns.
- col
- Column reference.
- collect_
list - Collect column values into list per group (PySpark collect_list). Use in groupBy.agg().
- collect_
set - Collect distinct column values into list per group (PySpark collect_set). Use in groupBy.agg().
- concat
- Concatenate string columns without separator (PySpark concat)
Concatenate string columns (PySpark concat). Panics if
columnsis empty. - concat_
ws - Concatenate string columns with separator (PySpark concat_ws)
Concatenate with separator (PySpark concat_ws). Panics if
columnsis empty. - contains
- True if string contains substring (literal). PySpark contains.
- conv
- Base conversion (PySpark conv). num from from_base to to_base.
- convert_
timezone - Convert timestamp between timezones (PySpark convert_timezone).
- corr
- Pearson correlation aggregation (PySpark corr). Module-level; use in groupBy.agg() with two columns.
- corr_
expr - Pearson correlation aggregation (PySpark corr). Returns Expr for use in groupBy.agg().
- cos
- Cosine in radians (PySpark cos)
- cosh
- Hyperbolic cosine (PySpark cosh).
- cot
- Cotangent: 1/tan (PySpark cot).
- count
- count_
distinct - Count distinct aggregation (PySpark countDistinct)
- count_
if - Count rows where condition is true (PySpark count_if). Use in groupBy.agg(); column should be boolean (true=1, false=0).
- covar_
pop - Population covariance aggregation (PySpark covar_pop). Module-level; use in groupBy.agg() with two columns.
- covar_
pop_ expr - Population covariance aggregation (PySpark covar_pop). Returns Expr for use in groupBy.agg().
- covar_
samp_ expr - Sample covariance aggregation (PySpark covar_samp). Returns Expr for use in groupBy.agg().
- crc32
- CRC32 of string bytes (PySpark crc32). Not implemented: requires element-wise UDF.
- create_
map - Build a map column from alternating key/value expressions (PySpark create_map). Returns List(Struct{key, value}) using Polars as_struct and concat_list. With no args (or empty slice), returns a column of empty maps per row (PySpark parity #275).
- csc
- Cosecant: 1/sin (PySpark csc).
- cume_
dist - Cumulative distribution in partition: row_number / count. Window is applied.
- curdate
- Alias for current_date (PySpark curdate).
- current_
catalog - Current catalog name stub (PySpark current_catalog).
- current_
database - Current database/schema name stub (PySpark current_database).
- current_
date - Current date (evaluation time). PySpark current_date.
- current_
schema - Current schema name stub (PySpark current_schema).
- current_
timestamp - Current timestamp (evaluation time). PySpark current_timestamp.
- current_
timezone - Current session timezone (PySpark current_timezone). Default “UTC”. Returns literal column.
- current_
user - Current user stub (PySpark current_user).
- date_
add - Add n days to date column (PySpark date_add).
- date_
diff - Alias for datediff (PySpark date_diff). date_diff(end, start).
- date_
format - Format date/datetime as string (PySpark date_format). Accepts PySpark/Java SimpleDateFormat style (e.g. “yyyy-MM”) and converts to chrono strftime internally.
- date_
from_ unix_ date - Days since epoch to date (PySpark date_from_unix_date).
- date_
part - Alias for extract (PySpark date_part).
- date_
sub - Subtract n days from date column (PySpark date_sub).
- date_
trunc - Alias for trunc (PySpark date_trunc).
- dateadd
- Alias for date_add (PySpark dateadd).
- datediff
- Number of days between two date columns (PySpark datediff).
- datepart
- Alias for extract (PySpark datepart).
- day
- Extract day of month from datetime column (PySpark day)
- dayname
- Weekday name “Mon”,“Tue”,… (PySpark dayname).
- dayofmonth
- Alias for day. PySpark dayofmonth.
- dayofweek
- Extract day of week: 1=Sunday..7=Saturday (PySpark dayofweek).
- dayofyear
- Extract day of year (1-366) (PySpark dayofyear).
- days
- Interval of n days (PySpark days). For use in date_add, timestampadd, etc.
- decode
- Decode binary (hex string) to string (PySpark decode). Charset: UTF-8.
- degrees
- Convert radians to degrees (PySpark degrees)
- dense_
rank - Dense rank window function (no gaps). Use with
.over(partition_by). - desc
- Descending sort, nulls last (Spark default for DESC).
- desc_
nulls_ first - Descending sort, nulls first.
- desc_
nulls_ last - Descending sort, nulls last.
- e
- Constant e = 2.718… (PySpark e).
- element_
at - Get element at 1-based index (PySpark element_at).
- elt
- Return column at 1-based index (PySpark elt). elt(2, a, b, c) returns b.
Element at 1-based index from list of columns (PySpark elt). Panics if
columnsis empty. - encode
- Encode string to binary (PySpark encode). Charset: UTF-8. Returns hex string.
- endswith
- True if string ends with suffix (PySpark endswith).
- eq
- equal_
null - Null-safe equality: true if both null or both equal (PySpark equal_null). Alias for eq_null_safe.
- every
- Alias for bool_and (PySpark every). Use in groupBy.agg().
- execute_
plan - Execute a logical plan; returns root-owned
DataFrame. - exp
- Exponential (PySpark exp)
- explode
- Explode list into one row per element (PySpark explode).
- explode_
outer - Explode; null/empty yields one row with null (PySpark explode_outer).
- expm1
- exp(x) - 1 (PySpark expm1).
- extract
- Extract field from date/datetime (PySpark extract). field: year, month, day, hour, minute, second, quarter, week, dayofweek, dayofyear.
- factorial
- Factorial n! (PySpark factorial). n in 0..=20; null for negative or overflow.
- find_
in_ set - 1-based index of str in comma-delimited set (PySpark find_in_set). 0 if not found or str contains comma.
- first
- First value in group (PySpark first). Use in groupBy.agg(). ignorenulls: when true, first non-null; Polars 0.45 uses .first() only (ignorenulls reserved for API compatibility).
- first_
value - First value in partition (PySpark first_value). Use with
.over(partition_by). - floor
- Floor (PySpark floor)
- format_
number - Format numeric as string with fixed decimal places (PySpark format_number).
- format_
string - Printf-style format (PySpark format_string). Supports %s, %d, %i, %f, %g, %%. Panics if
columnsis empty. - from_
csv - Parse CSV string to struct (PySpark from_csv). Minimal implementation.
- from_
json - Parse string column as JSON into struct (PySpark from_json).
- from_
unixtime - Convert seconds since epoch to formatted string (PySpark from_unixtime).
- from_
utc_ timestamp - Interpret timestamp as UTC, convert to tz (PySpark from_utc_timestamp).
- ge
- get
- Get value for key from map, or null (PySpark get).
- get_
json_ object - Extract JSON path from string column (PySpark get_json_object).
- getbit
- Get bit at 0-based position (PySpark getbit).
- greatest
- Greatest of the given columns per row (PySpark greatest). Uses element-wise UDF.
- grouping
- Grouping set marker (PySpark grouping). Stub: returns 0 (no GROUPING SETS in robin-sparkless).
- grouping_
id - Grouping set id (PySpark grouping_id). Stub: returns 0.
- gt
- hash
- Hash of column values (PySpark hash). Uses Murmur3 32-bit for parity with PySpark.
- hex
- Convert to hex string (PySpark hex).
- hour
- Extract hour from datetime column (PySpark hour).
- hours
- Interval of n hours (PySpark hours).
- hypot
- sqrt(xx + yy) (PySpark hypot).
- ifnull
- Alias for nvl. PySpark ifnull.
- ilike
- Case-insensitive LIKE. PySpark ilike. When escape_char is Some(esc), esc + char treats that char as literal.
- initcap
- Title case (PySpark initcap)
- inline
- Explode list of structs into rows; struct fields become columns after unnest (PySpark inline). Returns the exploded struct column; use unnest to expand struct fields to columns.
- inline_
outer - Like inline but null/empty yields one row of nulls (PySpark inline_outer).
- input_
file_ name - Stub input file name - empty string (PySpark input_file_name).
- instr
- Find substring position 1-based; 0 if not found (PySpark instr).
- is_in
- is_null
- isin
- Check if column values are in the given list (PySpark isin). Uses Polars is_in.
- isin_
i64 - Check if column values are in the given i64 slice (PySpark isin with literal list).
- isin_
str - Check if column values are in the given string slice (PySpark isin with literal list).
- isnan
- True where the float value is NaN (PySpark isnan).
- isnotnull
- True if column is not null. PySpark isnotnull.
- isnull
- True if column is null. PySpark isnull.
- json_
array_ length - Length of JSON array at path (PySpark json_array_length).
- json_
object_ keys - Keys of JSON object (PySpark json_object_keys). Returns list of strings.
- json_
tuple - Extract keys from JSON as struct (PySpark json_tuple). keys: e.g. [“a”, “b”].
- kurtosis
- Kurtosis aggregation (PySpark kurtosis). Fisher definition, bias=true. Use in groupBy.agg().
- lag
- Lag: value from n rows before in partition. Use with
.over(partition_by). - last_
day - Last day of month for date column (PySpark last_day).
- last_
value - Last value in partition (PySpark last_value). Use with
.over(partition_by). - lcase
- Alias for lower. PySpark lcase.
- le
- lead
- Lead: value from n rows after in partition. Use with
.over(partition_by). - least
- Least of the given columns per row (PySpark least). Uses element-wise UDF.
- left
- Leftmost n characters (PySpark left).
- length
- String length in characters (PySpark length)
- levenshtein
- Levenshtein distance (PySpark levenshtein). Not implemented: requires element-wise UDF.
- like
- SQL LIKE pattern (% any, _ one char). PySpark like. When escape_char is Some(esc), esc + char treats that char as literal.
- lit_
bool - lit_f64
- lit_i32
- lit_i64
- lit_
null - lit_str
- ln
- Alias for log (natural log). PySpark ln.
- localtimestamp
- Alias for current_timestamp (PySpark localtimestamp).
- locate
- Find substring position 1-based, starting at pos (PySpark locate). 0 if not found.
- log
- Natural logarithm (PySpark log with one arg)
- log2
- Base-2 log (PySpark log2).
- log1p
- log(1 + x) (PySpark log1p).
- log10
- Base-10 log (PySpark log10).
- log_
with_ base - Logarithm with given base (PySpark log(col, base)). base must be positive and not 1.
- lower
- Convert string column to lowercase (PySpark lower)
- lpad
- Left-pad string to length with pad char (PySpark lpad).
- lt
- ltrim
- Trim leading whitespace (PySpark ltrim)
- make_
date - Build date from year, month, day columns (PySpark make_date).
- make_
dt_ interval - Day-time interval: days, hours, minutes, seconds (PySpark make_dt_interval). All optional; 0 for omitted.
- make_
interval - Create interval duration (PySpark make_interval). Optional args; 0 for omitted.
- make_
timestamp - make_timestamp(year, month, day, hour, min, sec, timezone?) - six columns to timestamp (PySpark make_timestamp). When timezone is Some(tz), components are interpreted as local time in that zone, then converted to UTC.
- make_
timestamp_ ntz - Alias for make_timestamp (PySpark make_timestamp_ntz - no timezone).
- make_
ym_ interval - Year-month interval (PySpark make_ym_interval). Polars has no native YM type; return months as Int32 (years*12 + months).
- map_
concat - Merge two map columns (PySpark map_concat). Last value wins for duplicate keys.
- map_
contains_ key - True if map contains key (PySpark map_contains_key).
- map_
entries - Return map as list of structs {key, value} (PySpark map_entries).
- map_
filter - Filter map entries by predicate (PySpark map_filter).
- map_
filter_ value_ gt - Convenience: map_filter with value > threshold predicate.
- map_
from_ arrays - Build map from two array columns keys and values (PySpark map_from_arrays). Implemented via UDF.
- map_
from_ entries - Array of structs {key, value} to map (PySpark map_from_entries).
- map_
keys - Extract keys from a map column (PySpark map_keys). Map is List(Struct{key, value}).
- map_
values - Extract values from a map column (PySpark map_values).
- map_
zip_ with - Merge two maps by key with merge function (PySpark map_zip_with).
- map_
zip_ with_ coalesce - Convenience: map_zip_with with coalesce(value1, value2) merge.
- mask
- Mask string: replace upper/lower/digit/other with given chars (PySpark mask).
- max
- max_by
- Value of value_col in the row where ord_col is maximum (PySpark max_by). Use in groupBy.agg().
- md5
- MD5 hash of string bytes, return hex string (PySpark md5).
- mean
- median
- Median aggregation. PySpark median.
- min
- min_by
- Value of value_col in the row where ord_col is minimum (PySpark min_by). Use in groupBy.agg().
- minute
- Extract minute from datetime column (PySpark minute).
- minutes
- Interval of n minutes (PySpark minutes).
- mode
- Mode aggregation - most frequent value. PySpark mode.
- monotonically_
increasing_ id - Stub monotonically_increasing_id - constant 0 (PySpark monotonically_increasing_id). Note: differs from PySpark which is unique per-row; see PYSPARK_DIFFERENCES.md.
- month
- Extract month from datetime column (PySpark month)
- months
- Interval of n months (PySpark months). Approximated as 30*n days.
- months_
between - Months between end and start dates as fractional (PySpark months_between). When round_off is true, rounds to 8 decimal places (PySpark default).
- named_
struct - Create struct with explicit field names (PySpark named_struct). Pairs of (name, column).
Struct from (name, column) pairs (PySpark named_struct). Panics if
pairsis empty. - nanvl
- Replace NaN with value. PySpark nanvl.
- ne
- negate
- Unary minus / negate (PySpark negate, negative).
- negative
- Alias for negate. PySpark negative.
- next_
day - Next date that is the given weekday (e.g. “Mon”) (PySpark next_day).
- not_
- now
- Alias for current_timestamp (PySpark now).
- nth_
value - Nth value in partition by order (1-based n). Window is applied; do not call .over() again.
- ntile
- Ntile: bucket 1..n by rank within partition. Window is applied.
- nullif
- Return null if column equals value, else column. PySpark nullif.
- nvl
- Alias for coalesce(col, value). PySpark nvl / ifnull.
- nvl2
- Three-arg null replacement: if col1 is not null then col2 else col3. PySpark nvl2.
- octet_
length - Length of string in bytes (PySpark octet_length).
- or_
- overlay
- Replace substring at 1-based position (PySpark overlay). replace is literal.
- parse_
type_ name - Parse PySpark-like type name to Polars DataType. Decimal(precision, scale) is mapped to Float64 for schema parity (Polars dtype-decimal not enabled).
- parse_
url - Parse URL and extract part: PROTOCOL, HOST, PATH, etc. (PySpark parse_url). When key is Some(k) and part is QUERY/QUERYSTRING, returns the value for that query parameter only.
- percent_
rank - Percent rank in partition: (rank - 1) / (count - 1). Window is applied.
- percentile_
approx - Approximate percentile (PySpark percentile_approx). Alias for approx_percentile.
- pi
- Constant pi = 3.14159… (PySpark pi).
- pmod
- Positive modulus (PySpark pmod).
- posexplode
- Explode list with position (PySpark posexplode). Returns (pos_column, value_column). pos is 1-based; implemented via list.eval(cum_count()).explode() and explode().
- posexplode_
outer - Posexplode with null preservation (PySpark posexplode_outer).
- position
- Position of substring in column (PySpark position). Same as instr; (substr, col) argument order.
- positive
- Unary plus - no-op, returns column as-is (PySpark positive).
- pow
- Power (PySpark pow)
- power
- Alias for pow. PySpark power.
- printf
- Alias for format_string (PySpark printf).
- quarter
- Extract quarter (1-4) from date/datetime (PySpark quarter).
- radians
- Convert degrees to radians (PySpark radians)
- raise_
error - Raise an error when evaluated (PySpark raise_error). Always fails with the given message.
- rand
- Random uniform [0, 1) per row, with optional seed (PySpark rand). When added via with_column, generates one distinct value per row (PySpark-like).
- randn
- Random standard normal per row, with optional seed (PySpark randn). When added via with_column, generates one distinct value per row (PySpark-like).
- rank
- Rank window function (ties same rank, gaps). Use with
.over(partition_by). - regexp
- Alias for rlike (PySpark regexp).
- regexp_
count - Count of non-overlapping regex matches (PySpark regexp_count).
- regexp_
extract - Extract first match of regex (PySpark regexp_extract). group_index 0 = full match.
- regexp_
extract_ all - Extract all matches of regex (PySpark regexp_extract_all).
- regexp_
instr - 1-based position of first regex match (PySpark regexp_instr).
- regexp_
like - Check if string matches regex (PySpark regexp_like / rlike).
- regexp_
replace - Replace first match of regex (PySpark regexp_replace)
- regexp_
substr - First substring matching regex (PySpark regexp_substr). Null if no match.
- regr_
avgx_ expr - Regression: average of x (PySpark regr_avgx).
- regr_
avgy_ expr - Regression: average of y (PySpark regr_avgy).
- regr_
count_ expr - Regression: count of (y, x) pairs where both non-null (PySpark regr_count).
- regr_
intercept_ expr - Regression intercept: avg_y - slope*avg_x (PySpark regr_intercept).
- regr_
r2_ expr - Regression R-squared (PySpark regr_r2).
- regr_
slope_ expr - Regression slope: cov_samp(y,x)/var_samp(x) (PySpark regr_slope).
- regr_
sxx_ expr - Regression: sum((x - avg_x)^2) (PySpark regr_sxx).
- regr_
sxy_ expr - Regression: sum((x - avg_x)(y - avg_y)) (PySpark regr_sxy).
- regr_
syy_ expr - Regression: sum((y - avg_y)^2) (PySpark regr_syy).
- repeat
- Repeat string n times (PySpark repeat).
- replace
- Replace all occurrences of search with replacement (literal). PySpark replace.
- reverse
- Reverse string (PySpark reverse).
- right
- Rightmost n characters (PySpark right).
- rint
- Round to nearest integer (PySpark rint).
- rlike
- Alias for regexp_like. PySpark rlike / regexp.
- round
- Round (PySpark round)
- row_
number - Row number window function (1, 2, 3 by order within partition).
Use with
.over(partition_by)after ranking by an order column. - rpad
- Right-pad string to length with pad char (PySpark rpad).
- rtrim
- Trim trailing whitespace (PySpark rtrim)
- schema_
from_ json - Parse a schema from a JSON string (e.g. from a host binding).
- schema_
of_ csv - Schema of CSV string (PySpark schema_of_csv). Returns literal schema string; minimal stub.
- schema_
of_ json - Schema of JSON string (PySpark schema_of_json). Returns literal schema string; minimal stub.
- sec
- Secant: 1/cos (PySpark sec).
- second
- Extract second from datetime column (PySpark second).
- sequence
- Generate array of numbers from start to stop (inclusive) with optional step (PySpark sequence). step defaults to 1.
- sha1
- SHA1 hash of string bytes, return hex string (PySpark sha1).
- sha2
- SHA2 hash; bit_length 256, 384, or 512 (PySpark sha2).
- shift_
left - Bitwise left shift (PySpark shiftLeft). col << n.
- shift_
right - Bitwise signed right shift (PySpark shiftRight). col >> n.
- shift_
right_ unsigned - Bitwise unsigned right shift (PySpark shiftRightUnsigned). Logical shift for Long.
- shuffle
- Random permutation of list elements (PySpark shuffle).
- sign
- Alias for signum (PySpark sign).
- signum
- Sign of the number: -1, 0, or 1 (PySpark signum)
- sin
- Sine in radians (PySpark sin)
- sinh
- Hyperbolic sine (PySpark sinh).
- size
- Alias for array_size (PySpark size).
- skewness
- Skewness aggregation (PySpark skewness). bias=true. Use in groupBy.agg().
- soundex
- Soundex code (PySpark soundex). Not implemented: requires element-wise UDF.
- spark_
partition_ id - Stub partition id - always 0 (PySpark spark_partition_id).
- split
- Split string by delimiter (PySpark split). Optional limit: at most that many parts (remainder in last).
- split_
part - Split by delimiter and return 1-based part (PySpark split_part).
- sqrt
- Square root (PySpark sqrt)
- stack
- Stack columns into struct (PySpark stack). Alias for struct_.
- startswith
- True if string starts with prefix (PySpark startswith).
- std
- Alias for stddev (PySpark std).
- stddev
- Standard deviation (sample) aggregation (PySpark stddev / stddev_samp)
- stddev_
pop - Population standard deviation (ddof=0). PySpark stddev_pop.
- stddev_
samp - Sample standard deviation (ddof=1). Alias for stddev. PySpark stddev_samp.
- str_
to_ map - Parse string to map (PySpark str_to_map). Default delims: “,” and “:”.
- struct_
- Create struct from columns using column names as field names (PySpark struct).
Struct from columns (PySpark struct). Panics if
columnsis empty. - substr
- Alias for substring. PySpark substr.
- substring
- Substring with 1-based start (PySpark substring semantics)
- substring_
index - Substring before/after nth delimiter (PySpark substring_index).
- sum
- tan
- Tangent in radians (PySpark tan)
- tanh
- Hyperbolic tangent (PySpark tanh).
- timestamp_
micros - Convert microseconds since epoch to timestamp (PySpark timestamp_micros).
- timestamp_
millis - Convert milliseconds since epoch to timestamp (PySpark timestamp_millis).
- timestamp_
seconds - Convert seconds since epoch to timestamp (PySpark timestamp_seconds).
- timestampadd
- Add amount of unit to timestamp (PySpark timestampadd).
- timestampdiff
- Difference between timestamps in unit (PySpark timestampdiff).
- to_
binary - Convert to binary (PySpark to_binary). fmt: ‘utf-8’, ‘hex’.
- to_char
- Cast to string, optionally with format for datetime (PySpark to_char, to_varchar). When format is Some, uses date_format for datetime columns (PySpark format → chrono strftime); otherwise cast to string. Returns Err if the cast to string fails (invalid type name or unsupported column type).
- to_csv
- Format struct as CSV string (PySpark to_csv). Minimal implementation.
- to_date
- Cast or parse to date (PySpark to_date). When format is None: cast date/datetime to date, parse string with default formats. When format is Some: parse string with given format.
- to_
degrees - Alias for degrees. PySpark toDegrees.
- to_
engine_ error - Convert PolarsError to EngineError (for APIs that still return PolarsError).
- to_json
- Serialize struct column to JSON string (PySpark to_json).
- to_
number - Cast to numeric (PySpark to_number). Uses Double. Format parameter reserved for future use. Returns Err if the cast to double fails (invalid type name or unsupported column type).
- to_
radians - Alias for radians. PySpark toRadians.
- to_
timestamp - Cast to timestamp, or parse with format when provided (PySpark to_timestamp). When format is None, parses string columns with default format “%Y-%m-%d %H:%M:%S” (PySpark parity #273).
- to_
timestamp_ ltz - Parse as timestamp in local timezone, return UTC (PySpark to_timestamp_ltz).
- to_
timestamp_ ntz - Parse as timestamp without timezone (PySpark to_timestamp_ntz). Returns Datetime(_, None).
- to_
unix_ timestamp - Alias for unix_timestamp.
- to_
utc_ timestamp - Interpret timestamp as in tz, convert to UTC (PySpark to_utc_timestamp).
- to_
varchar - Alias for to_char (PySpark to_varchar).
- transform_
keys - Transform map keys by expr (PySpark transform_keys).
- transform_
values - Transform map values by expr (PySpark transform_values).
- translate
- Character-by-character translation (PySpark translate).
- trim
- Trim leading and trailing whitespace (PySpark trim)
- trunc
- Truncate date/datetime to unit (PySpark trunc).
- try_add
- Add that returns null on overflow (PySpark try_add). Uses checked arithmetic.
- try_
aes_ decrypt - Try AES decrypt (PySpark try_aes_decrypt). Returns null on failure.
- try_avg
- Average aggregation; null on invalid (PySpark try_avg). Use in groupBy.agg(). Maps to mean; reserved for API.
- try_
cast - Cast column to the given type, returning null on invalid conversion (PySpark try_cast). String-to-boolean uses custom parsing (“true”/“false”/“1”/“0”) since Polars does not support Utf8->Boolean. String-to-date accepts date and datetime strings; invalid strings become null.
- try_
divide - Division that returns null on divide-by-zero (PySpark try_divide).
- try_
element_ at - Element at index, null if out of bounds (PySpark try_element_at). Same as element_at for lists.
- try_
multiply - Multiply that returns null on overflow (PySpark try_multiply).
- try_
subtract - Subtract that returns null on overflow (PySpark try_subtract).
- try_sum
- Sum aggregation; null on overflow (PySpark try_sum). Use in groupBy.agg(). Polars sum does not overflow; reserved for API.
- try_
to_ binary - Try convert to binary; null on failure (PySpark try_to_binary).
- try_
to_ number - Cast to numeric, null on invalid (PySpark try_to_number). Format parameter reserved for future use. Returns Err if the try_cast setup fails (invalid type name); column values that cannot be parsed become null.
- try_
to_ timestamp - Cast to timestamp, null on invalid, or parse with format when provided (PySpark try_to_timestamp). When format is None, parses string columns with default format (null on invalid). #273
- typeof_
- Data type of column as string (PySpark typeof). Constant per column from schema.
- ucase
- Alias for upper. PySpark ucase.
- unbase64
- Base64 decode to string (PySpark unbase64). Invalid decode → null.
- unhex
- Convert hex string to binary/string (PySpark unhex).
- unix_
date - Date to days since 1970-01-01 (PySpark unix_date).
- unix_
micros - Timestamp to microseconds since epoch (PySpark unix_micros).
- unix_
millis - Timestamp to milliseconds since epoch (PySpark unix_millis).
- unix_
seconds - Timestamp to seconds since epoch (PySpark unix_seconds).
- unix_
timestamp - Parse string timestamp to seconds since epoch (PySpark unix_timestamp). format defaults to yyyy-MM-dd HH:mm:ss.
- unix_
timestamp_ now - Current Unix timestamp in seconds (PySpark unix_timestamp with no args).
- upper
- Convert string column to uppercase (PySpark upper)
- url_
decode - Percent-decode URL-encoded string (PySpark url_decode).
- url_
encode - Percent-encode string for URL (PySpark url_encode).
- user
- User stub (PySpark user).
- var_pop
- Population variance (ddof=0). PySpark var_pop.
- var_
samp - Sample variance (ddof=1). Alias for variance. PySpark var_samp.
- variance
- Variance (sample) aggregation (PySpark variance / var_samp)
- version
- Session/library version string (PySpark version).
- weekday
- Weekday 0=Mon, 6=Sun (PySpark weekday).
- weekofyear
- Extract ISO week of year (1-53) (PySpark weekofyear).
- when
- Start a when(condition).then(…).otherwise(…) chain.
- when_
then_ otherwise_ null - Two-arg when(condition, value): returns value where condition is true, null otherwise (PySpark when(cond, val)).
- width_
bucket - Assign value to histogram bucket (PySpark width_bucket). Returns 0 if v < min_val, num_bucket+1 if v >= max_val.
- xxhash64
- XXH64 hash (PySpark xxhash64). Not implemented: requires element-wise UDF.
- year
- Extract year from datetime column (PySpark year)
- years
- Interval of n years (PySpark years). Approximated as 365*n days.
- zip_
with - Zip two arrays element-wise with merge function (PySpark zip_with).
- zip_
with_ coalesce - Convenience: zip_with with coalesce(left, right) merge.
Type Aliases§
- Collected
Rows - Rows as list of maps: column name -> JSON value. Used by bindings and collect.
- Expr