Crate robin_sparkless

Expand description

Robin Sparkless - A Rust DataFrame library with PySpark-like API

This library provides a PySpark-compatible API. The root crate is engine-agnostic: it depends on robin-sparkless-core (types, expression IR, config) and one backend—currently robin-sparkless-polars, which uses Polars for execution. The public API exposes engine-agnostic expression types where possible.

§Expression APIs

ExprIr (engine-agnostic): Use col, lit_i64, lit_str, when, gt, eq, etc. from the crate root (re-exported from robin_sparkless_core). These build an ExprIr tree. Use DataFrame::filter_expr_ir, DataFrame::select_expr_ir, DataFrame::with_column_expr_ir, DataFrame::collect_rows, and GroupedData::agg_expr_ir with &ExprIr / &[ExprIr]. Collect returns CollectedRows (JSON-like rows). Prefer this for new code and embeddings.
Column / Expr (Polars-backed): Use prelude or robin_sparkless::functions::{col, lit_i64, ...} for the full PySpark-like API that returns Column and uses Polars Expr. Use DataFrame::filter, DataFrame::with_column, DataFrame::select_exprs, etc. with those types. Still supported for compatibility and advanced use.

§Getting started and embedding

For application code and embedding, use the prelude: use robin_sparkless::prelude::*. For a minimal FFI surface, use prelude::embed. For engine-agnostic expressions, use the root re-exports (col, lit_i64, gt, etc.) and the *_expr_ir / collect_rows methods.

§Panics and errors

Some functions panic when used with invalid or empty inputs (e.g. calling when(cond).otherwise(val) without .then(), or passing no columns to format_string, elt, concat, coalesce, or named_struct in Rust). In Rust, create_map and array return Result for empty input instead of panicking. From Python, empty columns for coalesce, format_string, printf, and named_struct raise ValueError. See the documentation for each function for details.

§API stability

While the crate is in the 0.x series, we follow semver but may introduce breaking changes in minor releases (e.g. 0.1 → 0.2) until 1.0. For behavioral caveats and intentional differences from PySpark, see the repository documentation.

Re-exports§

pub use dataframe::CubeRollupData;
pub use dataframe::DataFrame;
pub use dataframe::DataFrameNa;
pub use dataframe::DataFrameStat;
pub use dataframe::DataFrameWriter;
pub use dataframe::GroupedData;
pub use dataframe::PivotedGroupedData;
pub use session::DataFrameReader;
pub use session::SparkSession;
pub use session::SparkSessionBuilder;
pub use traits::FromRobinDf;
pub use traits::IntoRobinDf;

Modules§

column
config: Configuration for robin-sparkless sessions.
dataframe: Root-owned DataFrame API; delegates to robin-sparkless-polars for execution.
error: Engine error type for embedders (Polars conversion in this crate).
expression
functions: Expression builders: sort, types, column refs/literals, agg, when, string, datetime, struct/map/array, cast, hash, misc. Re-exports all public items so use crate::functions::* and pub use functions::* keep the same API.
plan: Plan execution; use execute_plan for root session/DataFrame. Re-exports plan error types.
prelude: One-stop prelude for application code and embedding.
schema: Schema types and Polars conversion. Types from core; conversion from robin-sparkless-polars.
session: Root-owned Session API; delegates to robin-sparkless-polars for execution.
sql: SQL parsing and execution; returns root-owned DataFrame.
traits: Traits for conversion to and from root-owned DataFrames.
type_coercion

Structs§

ChainedWhenBuilder: Builder for an additional when-then clause (returned by ThenBuilder::when).
Column: Column - represents a column in a DataFrame, used for building expressions Thin wrapper around Polars Expr. May carry a DeferredRandom for rand/randn so with_column can produce one value per row. May carry UdfCall for Python UDFs (eager execution at with_column).
PlanExprError: Error from parsing or interpreting a plan expression.
SortOrder: Sort order specification for use in orderBy/sort. Holds expr + direction + null placement.
SparklessConfig: Configuration for building a SparkSession.
StructField
StructType
ThenBuilder: Builder for chaining when-then clauses before finalizing with otherwise
UdfRegistry: Session-scoped UDF registry. Rust UDFs run lazily via Polars Expr::map.
WhenBuilder: When-then-otherwise builder.
WhenThenBuilder

Enums§

DataType
EngineError: Unified error type for robin-sparkless operations.
ExprIr: Expression IR: a single, serializable tree that backends convert to their native Expr.
JoinType: Re-export for API compatibility. Join type for DataFrame joins (PySpark-compatible)
LiteralValue: Literal value in an expression (engine-agnostic).
PlanError: Errors from plan execution.
PolarsError
SaveMode: Re-export for API compatibility. Save mode for saveAsTable (PySpark default is ErrorIfExists).
SelectItem: Re-export for API compatibility. Select item: either a column name (str) or an expression (PySpark parity: select(“a”, col(“b”).alias(“x”))). Fixes #645: select expects Column or str.
WriteFormat: Re-export for API compatibility. Output format for generic write (PySpark DataFrameWriter.format).
WriteMode: Re-export for API compatibility. Write mode: overwrite or append (PySpark DataFrameWriter.mode for path-based save).

Traits§

RustUdf: Rust UDF: takes columns as Series, returns one Series. Used via Expr::map / map_many.
StructTypePolarsExt: Extension trait for Polars schema conversion. Implemented for StructType from core. Bring this trait into scope to use StructType::from_polars_schema and to_polars_schema.

Functions§

abs: Absolute value (PySpark abs)
acos: Arc cosine (PySpark acos)
acosh: Inverse hyperbolic cosine (PySpark acosh).
add_months: Add n months to date column (PySpark add_months).
aes_decrypt: AES decrypt (PySpark aes_decrypt). Input hex(nonce||ciphertext).
aes_encrypt: AES encrypt (PySpark aes_encrypt). Key as string; AES-128-GCM.
aggregate: Array fold/aggregate (PySpark aggregate). Simplified: zero + sum(list elements).
alias: Alias an expression with a new output name.
and_
any_value: Any value from the group (PySpark any_value). Use in groupBy.agg(). ignorenulls reserved for API compatibility.
approx_count_distinct: Approximate count distinct (PySpark approx_count_distinct). Use in groupBy.agg(). rsd reserved for API compatibility; Polars uses exact n_unique.
approx_percentile: Approximate percentile (PySpark approx_percentile). Maps to quantile; percentage in 0.0..=1.0. accuracy reserved for API compatibility.
array: Create an array column from multiple columns (PySpark array). With no arguments, returns a column of empty arrays (one per row); PySpark parity.
array_agg: Collect to array (PySpark array_agg).
array_append: Append element to end of list (PySpark array_append).
array_compact: Remove null elements from list (PySpark array_compact).
array_contains: Check if list contains value (PySpark array_contains).
array_distinct: Distinct elements in list (PySpark array_distinct).
array_except: Elements in first array not in second (PySpark array_except).
array_exists: True if any list element satisfies the predicate (PySpark exists).
array_filter: Filter list elements by predicate (PySpark filter).
array_flatten: Flatten list of lists to one list (PySpark flatten). Not implemented.
array_forall: True if all list elements satisfy the predicate (PySpark forall).
array_insert: Insert element at 1-based position (PySpark array_insert).
array_intersect: Elements in both arrays (PySpark array_intersect).
array_join: Join list of strings with separator (PySpark array_join).
array_max: Maximum element in list (PySpark array_max).
array_mean: Mean of list elements (PySpark aggregate avg).
array_min: Minimum element in list (PySpark array_min).
array_position: 1-based index of first occurrence of value in list, or 0 if not found (PySpark array_position). Implemented via Polars list.eval with col(“”) as element.
array_prepend: Prepend element to start of list (PySpark array_prepend).
array_remove: New list with all elements equal to value removed (PySpark array_remove). Implemented via Polars list.eval + list.drop_nulls.
array_repeat: Repeat each element n times (PySpark array_repeat). Not implemented: would require list.eval with dynamic repeat.
array_size: Number of elements in list (PySpark size / array_size). Returns Int32.
array_slice: Slice list from 1-based start with optional length (PySpark slice).
array_sort: Sort list elements (PySpark array_sort).
array_sum: Sum of list elements (PySpark aggregate sum).
array_transform: Transform list elements by expression (PySpark transform).
array_union: Distinct elements from both arrays (PySpark array_union).
arrays_overlap: True if two arrays have any element in common (PySpark arrays_overlap).
arrays_zip: Zip arrays into array of structs (PySpark arrays_zip).
asc: Ascending sort, nulls first (Spark default for ASC).
asc_nulls_first: Ascending sort, nulls first.
asc_nulls_last: Ascending sort, nulls last.
ascii: ASCII value of first character (PySpark ascii). Returns Int32.
asin: Arc sine (PySpark asin)
asinh: Inverse hyperbolic sine (PySpark asinh).
assert_true: Assert that all boolean values are true; errors otherwise (PySpark assert_true). When err_msg is Some, it is used in the error message when assertion fails.
atan: Arc tangent (PySpark atan)
atan2: Two-argument arc tangent atan2(y, x) in radians (PySpark atan2)
atanh: Inverse hyperbolic tangent (PySpark atanh).
avg: Average aggregation
base64: Base64 encode string bytes (PySpark base64).
between
bin: Convert integer to binary string (PySpark bin).
bit_and: Bitwise AND of two integer/boolean columns (PySpark bit_and).
bit_count: Count of set bits in the integer representation (PySpark bit_count).
bit_get: Alias for getbit (PySpark bit_get).
bit_length: Bit length of string (bytes * 8) (PySpark bit_length).
bit_or: Bitwise OR of two integer/boolean columns (PySpark bit_or).
bit_xor: Bitwise XOR of two integer/boolean columns (PySpark bit_xor).
bitmap_bit_position: Map integral value (0–32767) to bit position for bitmap aggregates (PySpark bitmap_bit_position).
bitmap_bucket_number: Bucket number for distributed bitmap (PySpark bitmap_bucket_number). value / 32768.
bitmap_construct_agg: Aggregate: bitwise OR of bit positions into one bitmap binary (PySpark bitmap_construct_agg). Use in group_by(…).agg([bitmap_construct_agg(col)]).
bitmap_count: Count set bits in a bitmap binary column (PySpark bitmap_count).
bitmap_or_agg: Aggregate: bitwise OR of bitmap binary column (PySpark bitmap_or_agg).
bitwise_not: Bitwise NOT of an integer/boolean column (PySpark bitwise_not / bitwiseNOT).
bool_and: Boolean AND across group (PySpark bool_and). Use in groupBy.agg(); column should be boolean.
broadcast: Broadcast hint - no-op that returns the same DataFrame (PySpark broadcast).
bround: Banker’s rounding - round half to even (PySpark bround).
btrim: Trim leading and trailing chars (PySpark btrim). trim_str defaults to whitespace.
call: Generic function call (for the long tail of functions).
call_udf: Call a registered UDF by name. PySpark: F.call_udf(udfName, *cols). Requires a session (set by get_or_create). Raises if UDF not found.
cardinality: Cardinality: number of elements in array (PySpark cardinality). Alias for size/array_size.
cast: Cast column to the given type (PySpark cast). Fails on invalid conversion. String-to-boolean uses custom parsing (“true”/“false”/“1”/“0”) since Polars does not support Utf8->Boolean. String-to-date accepts date and datetime strings (e.g. “2025-01-01 10:30:00” truncates to date) for Spark parity.
cbrt: Cube root (PySpark cbrt).
ceil: Ceiling (PySpark ceil)
ceiling: Alias for ceil. PySpark ceiling.
char: Int to single-character string (PySpark char). Valid codepoint only.
char_length: Length of string in characters (PySpark char_length). Alias of length().
character_length: Length of string in characters (PySpark character_length). Alias of length().
chr: Alias for char (PySpark chr).
coalesce: Coalesce - returns the first non-null value from multiple columns.
col: Column reference.
collect_list: Collect column values into list per group (PySpark collect_list). Use in groupBy.agg().
collect_set: Collect distinct column values into list per group (PySpark collect_set). Use in groupBy.agg().
concat: Concatenate string columns without separator (PySpark concat) Concatenate string columns (PySpark concat). Panics if columns is empty.
concat_ws: Concatenate string columns with separator (PySpark concat_ws) Concatenate with separator (PySpark concat_ws). Panics if columns is empty.
contains: True if string contains substring (literal). PySpark contains.
conv: Base conversion (PySpark conv). num from from_base to to_base.
convert_timezone: Convert timestamp between timezones (PySpark convert_timezone).
corr: Pearson correlation aggregation (PySpark corr). Module-level; use in groupBy.agg() with two columns.
corr_expr: Pearson correlation aggregation (PySpark corr). Returns Expr for use in groupBy.agg().
cos: Cosine in radians (PySpark cos)
cosh: Hyperbolic cosine (PySpark cosh).
cot: Cotangent: 1/tan (PySpark cot).
count
count_distinct: Count distinct aggregation (PySpark countDistinct)
count_if: Count rows where condition is true (PySpark count_if). Use in groupBy.agg(); column should be boolean (true=1, false=0).
covar_pop: Population covariance aggregation (PySpark covar_pop). Module-level; use in groupBy.agg() with two columns.
covar_pop_expr: Population covariance aggregation (PySpark covar_pop). Returns Expr for use in groupBy.agg().
covar_samp_expr: Sample covariance aggregation (PySpark covar_samp). Returns Expr for use in groupBy.agg().
crc32: CRC32 of string bytes (PySpark crc32). Not implemented: requires element-wise UDF.
create_map: Build a map column from alternating key/value expressions (PySpark create_map). Returns List(Struct{key, value}) using Polars as_struct and concat_list. With no args (or empty slice), returns a column of empty maps per row (PySpark parity #275).
csc: Cosecant: 1/sin (PySpark csc).
cume_dist: Cumulative distribution in partition: row_number / count. Window is applied.
curdate: Alias for current_date (PySpark curdate).
current_catalog: Current catalog name stub (PySpark current_catalog).
current_database: Current database/schema name stub (PySpark current_database).
current_date: Current date (evaluation time). PySpark current_date.
current_schema: Current schema name stub (PySpark current_schema).
current_timestamp: Current timestamp (evaluation time). PySpark current_timestamp.
current_timezone: Current session timezone (PySpark current_timezone). Default “UTC”. Returns literal column.
current_user: Current user stub (PySpark current_user).
date_add: Add n days to date column (PySpark date_add).
date_diff: Alias for datediff (PySpark date_diff). date_diff(end, start).
date_format: Format date/datetime as string (PySpark date_format). Accepts PySpark/Java SimpleDateFormat style (e.g. “yyyy-MM”) and converts to chrono strftime internally.
date_from_unix_date: Days since epoch to date (PySpark date_from_unix_date).
date_part: Alias for extract (PySpark date_part).
date_sub: Subtract n days from date column (PySpark date_sub).
date_trunc: Alias for trunc (PySpark date_trunc).
dateadd: Alias for date_add (PySpark dateadd).
datediff: Number of days between two date columns (PySpark datediff).
datepart: Alias for extract (PySpark datepart).
day: Extract day of month from datetime column (PySpark day)
dayname: Weekday name “Mon”,“Tue”,… (PySpark dayname).
dayofmonth: Alias for day. PySpark dayofmonth.
dayofweek: Extract day of week: 1=Sunday..7=Saturday (PySpark dayofweek).
dayofyear: Extract day of year (1-366) (PySpark dayofyear).
days: Interval of n days (PySpark days). For use in date_add, timestampadd, etc.
decode: Decode binary (hex string) to string (PySpark decode). Charset: UTF-8.
degrees: Convert radians to degrees (PySpark degrees)
dense_rank: Dense rank window function (no gaps). Use with .over(partition_by).
desc: Descending sort, nulls last (Spark default for DESC).
desc_nulls_first: Descending sort, nulls first.
desc_nulls_last: Descending sort, nulls last.
e: Constant e = 2.718… (PySpark e).
element_at: Get element at 1-based index (PySpark element_at).
elt: Return column at 1-based index (PySpark elt). elt(2, a, b, c) returns b. Element at 1-based index from list of columns (PySpark elt). Panics if columns is empty.
encode: Encode string to binary (PySpark encode). Charset: UTF-8. Returns hex string.
endswith: True if string ends with suffix (PySpark endswith).
eq
equal_null: Null-safe equality: true if both null or both equal (PySpark equal_null). Alias for eq_null_safe.
every: Alias for bool_and (PySpark every). Use in groupBy.agg().
execute_plan: Execute a logical plan; returns root-owned DataFrame.
exp: Exponential (PySpark exp)
explode: Explode list into one row per element (PySpark explode).
explode_outer: Explode; null/empty yields one row with null (PySpark explode_outer).
expm1: exp(x) - 1 (PySpark expm1).
extract: Extract field from date/datetime (PySpark extract). field: year, month, day, hour, minute, second, quarter, week, dayofweek, dayofyear.
factorial: Factorial n! (PySpark factorial). n in 0..=20; null for negative or overflow.
find_in_set: 1-based index of str in comma-delimited set (PySpark find_in_set). 0 if not found or str contains comma.
first: First value in group (PySpark first). Use in groupBy.agg(). ignorenulls: when true, first non-null; Polars 0.45 uses .first() only (ignorenulls reserved for API compatibility).
first_value: First value in partition (PySpark first_value). Use with .over(partition_by).
floor: Floor (PySpark floor)
format_number: Format numeric as string with fixed decimal places (PySpark format_number).
format_string: Printf-style format (PySpark format_string). Supports %s, %d, %i, %f, %g, %%. Panics if columns is empty.
from_csv: Parse CSV string to struct (PySpark from_csv). Minimal implementation.
from_json: Parse string column as JSON into struct (PySpark from_json).
from_unixtime: Convert seconds since epoch to formatted string (PySpark from_unixtime).
from_utc_timestamp: Interpret timestamp as UTC, convert to tz (PySpark from_utc_timestamp).
ge
get: Get value for key from map, or null (PySpark get).
get_json_object: Extract JSON path from string column (PySpark get_json_object).
getbit: Get bit at 0-based position (PySpark getbit).
greatest: Greatest of the given columns per row (PySpark greatest). Uses element-wise UDF.
grouping: Grouping set marker (PySpark grouping). Stub: returns 0 (no GROUPING SETS in robin-sparkless).
grouping_id: Grouping set id (PySpark grouping_id). Stub: returns 0.
gt
hash: Hash of column values (PySpark hash). Uses Murmur3 32-bit for parity with PySpark.
hex: Convert to hex string (PySpark hex).
hour: Extract hour from datetime column (PySpark hour).
hours: Interval of n hours (PySpark hours).
hypot: sqrt(xx + yy) (PySpark hypot).
ifnull: Alias for nvl. PySpark ifnull.
ilike: Case-insensitive LIKE. PySpark ilike. When escape_char is Some(esc), esc + char treats that char as literal.
initcap: Title case (PySpark initcap)
inline: Explode list of structs into rows; struct fields become columns after unnest (PySpark inline). Returns the exploded struct column; use unnest to expand struct fields to columns.
inline_outer: Like inline but null/empty yields one row of nulls (PySpark inline_outer).
input_file_name: Stub input file name - empty string (PySpark input_file_name).
instr: Find substring position 1-based; 0 if not found (PySpark instr).
is_in
is_null
isin: Check if column values are in the given list (PySpark isin). Uses Polars is_in.
isin_i64: Check if column values are in the given i64 slice (PySpark isin with literal list).
isin_str: Check if column values are in the given string slice (PySpark isin with literal list).
isnan: True where the float value is NaN (PySpark isnan).
isnotnull: True if column is not null. PySpark isnotnull.
isnull: True if column is null. PySpark isnull.
json_array_length: Length of JSON array at path (PySpark json_array_length).
json_object_keys: Keys of JSON object (PySpark json_object_keys). Returns list of strings.
json_tuple: Extract keys from JSON as struct (PySpark json_tuple). keys: e.g. [“a”, “b”].
kurtosis: Kurtosis aggregation (PySpark kurtosis). Fisher definition, bias=true. Use in groupBy.agg().
lag: Lag: value from n rows before in partition. Use with .over(partition_by).
last_day: Last day of month for date column (PySpark last_day).
last_value: Last value in partition (PySpark last_value). Use with .over(partition_by).
lcase: Alias for lower. PySpark lcase.
le
lead: Lead: value from n rows after in partition. Use with .over(partition_by).
least: Least of the given columns per row (PySpark least). Uses element-wise UDF.
left: Leftmost n characters (PySpark left).
length: String length in characters (PySpark length)
levenshtein: Levenshtein distance (PySpark levenshtein). Not implemented: requires element-wise UDF.
like: SQL LIKE pattern (% any, _ one char). PySpark like. When escape_char is Some(esc), esc + char treats that char as literal.
lit_bool
lit_f64
lit_i32
lit_i64
lit_null
lit_str
ln: Alias for log (natural log). PySpark ln.
localtimestamp: Alias for current_timestamp (PySpark localtimestamp).
locate: Find substring position 1-based, starting at pos (PySpark locate). 0 if not found.
log: Natural logarithm (PySpark log with one arg)
log2: Base-2 log (PySpark log2).
log1p: log(1 + x) (PySpark log1p).
log10: Base-10 log (PySpark log10).
log_with_base: Logarithm with given base (PySpark log(col, base)). base must be positive and not 1.
lower: Convert string column to lowercase (PySpark lower)
lpad: Left-pad string to length with pad char (PySpark lpad).
lt
ltrim: Trim leading whitespace (PySpark ltrim)
make_date: Build date from year, month, day columns (PySpark make_date).
make_dt_interval: Day-time interval: days, hours, minutes, seconds (PySpark make_dt_interval). All optional; 0 for omitted.
make_interval: Create interval duration (PySpark make_interval). Optional args; 0 for omitted.
make_timestamp: make_timestamp(year, month, day, hour, min, sec, timezone?) - six columns to timestamp (PySpark make_timestamp). When timezone is Some(tz), components are interpreted as local time in that zone, then converted to UTC.
make_timestamp_ntz: Alias for make_timestamp (PySpark make_timestamp_ntz - no timezone).
make_ym_interval: Year-month interval (PySpark make_ym_interval). Polars has no native YM type; return months as Int32 (years*12 + months).
map_concat: Merge two map columns (PySpark map_concat). Last value wins for duplicate keys.
map_contains_key: True if map contains key (PySpark map_contains_key).
map_entries: Return map as list of structs {key, value} (PySpark map_entries).
map_filter: Filter map entries by predicate (PySpark map_filter).
map_filter_value_gt: Convenience: map_filter with value > threshold predicate.
map_from_arrays: Build map from two array columns keys and values (PySpark map_from_arrays). Implemented via UDF.
map_from_entries: Array of structs {key, value} to map (PySpark map_from_entries).
map_keys: Extract keys from a map column (PySpark map_keys). Map is List(Struct{key, value}).
map_values: Extract values from a map column (PySpark map_values).
map_zip_with: Merge two maps by key with merge function (PySpark map_zip_with).
map_zip_with_coalesce: Convenience: map_zip_with with coalesce(value1, value2) merge.
mask: Mask string: replace upper/lower/digit/other with given chars (PySpark mask).
max
max_by: Value of value_col in the row where ord_col is maximum (PySpark max_by). Use in groupBy.agg().
md5: MD5 hash of string bytes, return hex string (PySpark md5).
mean
median: Median aggregation. PySpark median.
min
min_by: Value of value_col in the row where ord_col is minimum (PySpark min_by). Use in groupBy.agg().
minute: Extract minute from datetime column (PySpark minute).
minutes: Interval of n minutes (PySpark minutes).
mode: Mode aggregation - most frequent value. PySpark mode.
monotonically_increasing_id: Stub monotonically_increasing_id - constant 0 (PySpark monotonically_increasing_id). Note: differs from PySpark which is unique per-row; see PYSPARK_DIFFERENCES.md.
month: Extract month from datetime column (PySpark month)
months: Interval of n months (PySpark months). Approximated as 30*n days.
months_between: Months between end and start dates as fractional (PySpark months_between). When round_off is true, rounds to 8 decimal places (PySpark default).
named_struct: Create struct with explicit field names (PySpark named_struct). Pairs of (name, column). Struct from (name, column) pairs (PySpark named_struct). Panics if pairs is empty.
nanvl: Replace NaN with value. PySpark nanvl.
ne
negate: Unary minus / negate (PySpark negate, negative).
negative: Alias for negate. PySpark negative.
next_day: Next date that is the given weekday (e.g. “Mon”) (PySpark next_day).
not_
now: Alias for current_timestamp (PySpark now).
nth_value: Nth value in partition by order (1-based n). Window is applied; do not call .over() again.
ntile: Ntile: bucket 1..n by rank within partition. Window is applied.
nullif: Return null if column equals value, else column. PySpark nullif.
nvl: Alias for coalesce(col, value). PySpark nvl / ifnull.
nvl2: Three-arg null replacement: if col1 is not null then col2 else col3. PySpark nvl2.
octet_length: Length of string in bytes (PySpark octet_length).
or_
overlay: Replace substring at 1-based position (PySpark overlay). replace is literal.
parse_type_name: Parse PySpark-like type name to Polars DataType. Decimal(precision, scale) is mapped to Float64 for schema parity (Polars dtype-decimal not enabled).
parse_url: Parse URL and extract part: PROTOCOL, HOST, PATH, etc. (PySpark parse_url). When key is Some(k) and part is QUERY/QUERYSTRING, returns the value for that query parameter only.
percent_rank: Percent rank in partition: (rank - 1) / (count - 1). Window is applied.
percentile_approx: Approximate percentile (PySpark percentile_approx). Alias for approx_percentile.
pi: Constant pi = 3.14159… (PySpark pi).
pmod: Positive modulus (PySpark pmod).
posexplode: Explode list with position (PySpark posexplode). Returns (pos_column, value_column). pos is 1-based; implemented via list.eval(cum_count()).explode() and explode().
posexplode_outer: Posexplode with null preservation (PySpark posexplode_outer).
position: Position of substring in column (PySpark position). Same as instr; (substr, col) argument order.
positive: Unary plus - no-op, returns column as-is (PySpark positive).
pow: Power (PySpark pow)
power: Alias for pow. PySpark power.
printf: Alias for format_string (PySpark printf).
quarter: Extract quarter (1-4) from date/datetime (PySpark quarter).
radians: Convert degrees to radians (PySpark radians)
raise_error: Raise an error when evaluated (PySpark raise_error). Always fails with the given message.
rand: Random uniform [0, 1) per row, with optional seed (PySpark rand). When added via with_column, generates one distinct value per row (PySpark-like).
randn: Random standard normal per row, with optional seed (PySpark randn). When added via with_column, generates one distinct value per row (PySpark-like).
rank: Rank window function (ties same rank, gaps). Use with .over(partition_by).
regexp: Alias for rlike (PySpark regexp).
regexp_count: Count of non-overlapping regex matches (PySpark regexp_count).
regexp_extract: Extract first match of regex (PySpark regexp_extract). group_index 0 = full match.
regexp_extract_all: Extract all matches of regex (PySpark regexp_extract_all).
regexp_instr: 1-based position of first regex match (PySpark regexp_instr).
regexp_like: Check if string matches regex (PySpark regexp_like / rlike).
regexp_replace: Replace first match of regex (PySpark regexp_replace)
regexp_substr: First substring matching regex (PySpark regexp_substr). Null if no match.
regr_avgx_expr: Regression: average of x (PySpark regr_avgx).
regr_avgy_expr: Regression: average of y (PySpark regr_avgy).
regr_count_expr: Regression: count of (y, x) pairs where both non-null (PySpark regr_count).
regr_intercept_expr: Regression intercept: avg_y - slope*avg_x (PySpark regr_intercept).
regr_r2_expr: Regression R-squared (PySpark regr_r2).
regr_slope_expr: Regression slope: cov_samp(y,x)/var_samp(x) (PySpark regr_slope).
regr_sxx_expr: Regression: sum((x - avg_x)^2) (PySpark regr_sxx).
regr_sxy_expr: Regression: sum((x - avg_x)(y - avg_y)) (PySpark regr_sxy).
regr_syy_expr: Regression: sum((y - avg_y)^2) (PySpark regr_syy).
repeat: Repeat string n times (PySpark repeat).
replace: Replace all occurrences of search with replacement (literal). PySpark replace.
reverse: Reverse string (PySpark reverse).
right: Rightmost n characters (PySpark right).
rint: Round to nearest integer (PySpark rint).
rlike: Alias for regexp_like. PySpark rlike / regexp.
round: Round (PySpark round)
row_number: Row number window function (1, 2, 3 by order within partition). Use with .over(partition_by) after ranking by an order column.
rpad: Right-pad string to length with pad char (PySpark rpad).
rtrim: Trim trailing whitespace (PySpark rtrim)
schema_from_json: Parse a schema from a JSON string (e.g. from a host binding).
schema_of_csv: Schema of CSV string (PySpark schema_of_csv). Returns literal schema string; minimal stub.
schema_of_json: Schema of JSON string (PySpark schema_of_json). Returns literal schema string; minimal stub.
sec: Secant: 1/cos (PySpark sec).
second: Extract second from datetime column (PySpark second).
sequence: Generate array of numbers from start to stop (inclusive) with optional step (PySpark sequence). step defaults to 1.
sha1: SHA1 hash of string bytes, return hex string (PySpark sha1).
sha2: SHA2 hash; bit_length 256, 384, or 512 (PySpark sha2).
shift_left: Bitwise left shift (PySpark shiftLeft). col << n.
shift_right: Bitwise signed right shift (PySpark shiftRight). col >> n.
shift_right_unsigned: Bitwise unsigned right shift (PySpark shiftRightUnsigned). Logical shift for Long.
shuffle: Random permutation of list elements (PySpark shuffle).
sign: Alias for signum (PySpark sign).
signum: Sign of the number: -1, 0, or 1 (PySpark signum)
sin: Sine in radians (PySpark sin)
sinh: Hyperbolic sine (PySpark sinh).
size: Alias for array_size (PySpark size).
skewness: Skewness aggregation (PySpark skewness). bias=true. Use in groupBy.agg().
soundex: Soundex code (PySpark soundex). Not implemented: requires element-wise UDF.
spark_partition_id: Stub partition id - always 0 (PySpark spark_partition_id).
split: Split string by delimiter (PySpark split). Optional limit: at most that many parts (remainder in last).
split_part: Split by delimiter and return 1-based part (PySpark split_part).
sqrt: Square root (PySpark sqrt)
stack: Stack columns into struct (PySpark stack). Alias for struct_.
startswith: True if string starts with prefix (PySpark startswith).
std: Alias for stddev (PySpark std).
stddev: Standard deviation (sample) aggregation (PySpark stddev / stddev_samp)
stddev_pop: Population standard deviation (ddof=0). PySpark stddev_pop.
stddev_samp: Sample standard deviation (ddof=1). Alias for stddev. PySpark stddev_samp.
str_to_map: Parse string to map (PySpark str_to_map). Default delims: “,” and “:”.
struct_: Create struct from columns using column names as field names (PySpark struct). Struct from columns (PySpark struct). Panics if columns is empty.
substr: Alias for substring. PySpark substr.
substring: Substring with 1-based start (PySpark substring semantics)
substring_index: Substring before/after nth delimiter (PySpark substring_index).
sum
tan: Tangent in radians (PySpark tan)
tanh: Hyperbolic tangent (PySpark tanh).
timestamp_micros: Convert microseconds since epoch to timestamp (PySpark timestamp_micros).
timestamp_millis: Convert milliseconds since epoch to timestamp (PySpark timestamp_millis).
timestamp_seconds: Convert seconds since epoch to timestamp (PySpark timestamp_seconds).
timestampadd: Add amount of unit to timestamp (PySpark timestampadd).
timestampdiff: Difference between timestamps in unit (PySpark timestampdiff).
to_binary: Convert to binary (PySpark to_binary). fmt: ‘utf-8’, ‘hex’.
to_char: Cast to string, optionally with format for datetime (PySpark to_char, to_varchar). When format is Some, uses date_format for datetime columns (PySpark format → chrono strftime); otherwise cast to string. Returns Err if the cast to string fails (invalid type name or unsupported column type).
to_csv: Format struct as CSV string (PySpark to_csv). Minimal implementation.
to_date: Cast or parse to date (PySpark to_date). When format is None: cast date/datetime to date, parse string with default formats. When format is Some: parse string with given format.
to_degrees: Alias for degrees. PySpark toDegrees.
to_engine_error: Convert PolarsError to EngineError (for APIs that still return PolarsError).
to_json: Serialize struct column to JSON string (PySpark to_json).
to_number: Cast to numeric (PySpark to_number). Uses Double. Format parameter reserved for future use. Returns Err if the cast to double fails (invalid type name or unsupported column type).
to_radians: Alias for radians. PySpark toRadians.
to_timestamp: Cast to timestamp, or parse with format when provided (PySpark to_timestamp). When format is None, parses string columns with default format “%Y-%m-%d %H:%M:%S” (PySpark parity #273).
to_timestamp_ltz: Parse as timestamp in local timezone, return UTC (PySpark to_timestamp_ltz).
to_timestamp_ntz: Parse as timestamp without timezone (PySpark to_timestamp_ntz). Returns Datetime(_, None).
to_unix_timestamp: Alias for unix_timestamp.
to_utc_timestamp: Interpret timestamp as in tz, convert to UTC (PySpark to_utc_timestamp).
to_varchar: Alias for to_char (PySpark to_varchar).
transform_keys: Transform map keys by expr (PySpark transform_keys).
transform_values: Transform map values by expr (PySpark transform_values).
translate: Character-by-character translation (PySpark translate).
trim: Trim leading and trailing whitespace (PySpark trim)
trunc: Truncate date/datetime to unit (PySpark trunc).
try_add: Add that returns null on overflow (PySpark try_add). Uses checked arithmetic.
try_aes_decrypt: Try AES decrypt (PySpark try_aes_decrypt). Returns null on failure.
try_avg: Average aggregation; null on invalid (PySpark try_avg). Use in groupBy.agg(). Maps to mean; reserved for API.
try_cast: Cast column to the given type, returning null on invalid conversion (PySpark try_cast). String-to-boolean uses custom parsing (“true”/“false”/“1”/“0”) since Polars does not support Utf8->Boolean. String-to-date accepts date and datetime strings; invalid strings become null.
try_divide: Division that returns null on divide-by-zero (PySpark try_divide).
try_element_at: Element at index, null if out of bounds (PySpark try_element_at). Same as element_at for lists.
try_multiply: Multiply that returns null on overflow (PySpark try_multiply).
try_subtract: Subtract that returns null on overflow (PySpark try_subtract).
try_sum: Sum aggregation; null on overflow (PySpark try_sum). Use in groupBy.agg(). Polars sum does not overflow; reserved for API.
try_to_binary: Try convert to binary; null on failure (PySpark try_to_binary).
try_to_number: Cast to numeric, null on invalid (PySpark try_to_number). Format parameter reserved for future use. Returns Err if the try_cast setup fails (invalid type name); column values that cannot be parsed become null.
try_to_timestamp: Cast to timestamp, null on invalid, or parse with format when provided (PySpark try_to_timestamp). When format is None, parses string columns with default format (null on invalid). #273
typeof_: Data type of column as string (PySpark typeof). Constant per column from schema.
ucase: Alias for upper. PySpark ucase.
unbase64: Base64 decode to string (PySpark unbase64). Invalid decode → null.
unhex: Convert hex string to binary/string (PySpark unhex).
unix_date: Date to days since 1970-01-01 (PySpark unix_date).
unix_micros: Timestamp to microseconds since epoch (PySpark unix_micros).
unix_millis: Timestamp to milliseconds since epoch (PySpark unix_millis).
unix_seconds: Timestamp to seconds since epoch (PySpark unix_seconds).
unix_timestamp: Parse string timestamp to seconds since epoch (PySpark unix_timestamp). format defaults to yyyy-MM-dd HH:mm:ss.
unix_timestamp_now: Current Unix timestamp in seconds (PySpark unix_timestamp with no args).
upper: Convert string column to uppercase (PySpark upper)
url_decode: Percent-decode URL-encoded string (PySpark url_decode).
url_encode: Percent-encode string for URL (PySpark url_encode).
user: User stub (PySpark user).
var_pop: Population variance (ddof=0). PySpark var_pop.
var_samp: Sample variance (ddof=1). Alias for variance. PySpark var_samp.
variance: Variance (sample) aggregation (PySpark variance / var_samp)
version: Session/library version string (PySpark version).
weekday: Weekday 0=Mon, 6=Sun (PySpark weekday).
weekofyear: Extract ISO week of year (1-53) (PySpark weekofyear).
when: Start a when(condition).then(…).otherwise(…) chain.
when_then_otherwise_null: Two-arg when(condition, value): returns value where condition is true, null otherwise (PySpark when(cond, val)).
width_bucket: Assign value to histogram bucket (PySpark width_bucket). Returns 0 if v < min_val, num_bucket+1 if v >= max_val.
xxhash64: XXH64 hash (PySpark xxhash64). Not implemented: requires element-wise UDF.
year: Extract year from datetime column (PySpark year)
years: Interval of n years (PySpark years). Approximated as 365*n days.
zip_with: Zip two arrays element-wise with merge function (PySpark zip_with).
zip_with_coalesce: Convenience: zip_with with coalesce(left, right) merge.

Type Aliases§

CollectedRows: Rows as list of maps: column name -> JSON value. Used by bindings and collect.
Expr