Skip to main content

Module udfs

Module udfs 

Source
Expand description

Helpers for element-wise UDFs used by map() expressions (soundex, levenshtein, crc32, xxhash64, array_flatten, array_repeat). These run at plan execution time when Polars invokes the closure.

Functions§

apply_acos
Apply acos to a float column.
apply_acosh
apply_add_months
add_months(date_column, n) - add n months to each date.
apply_aes_decrypt
AES decrypt (PySpark aes_decrypt). Input hex(nonce||ciphertext). Returns null on failure.
apply_aes_encrypt
AES encrypt (PySpark aes_encrypt). Key as string; uses AES-128-GCM. Output hex.
apply_array_append
Append element to end of each list (PySpark array_append).
apply_array_distinct_first_order
Distinct elements in list preserving first-occurrence order (PySpark array_distinct parity).
apply_array_except
Elements in first array not in second (PySpark array_except).
apply_array_flatten
Flatten list-of-lists to a single list per row (PySpark flatten).
apply_array_insert
Insert element at 1-based position (PySpark array_insert). Negative pos = from end.
apply_array_intersect
Elements in both arrays (PySpark array_intersect). Distinct.
apply_array_prepend
Prepend element to start of each list (PySpark array_prepend).
apply_array_repeat
Repeat each element n times (PySpark array_repeat). Supports both: (1) scalar column (string, int, etc.) - create array of n copies; (2) List column - repeat each element within the list.
apply_array_union
Distinct elements from both arrays (PySpark array_union).
apply_arrays_overlap
True if two arrays have any element in common (PySpark arrays_overlap).
apply_arrays_zip
Zip two arrays into array of structs (PySpark arrays_zip).
apply_ascii
ASCII value of first character (PySpark ascii). Returns Int32.
apply_asin
Apply asin to a float column.
apply_asinh
apply_assert_true
Assert that all boolean values are true (PySpark assert_true). PySpark: returns null when input is true; throws when input is false or null. When err_msg is Some, it is used in the error message when assertion fails.
apply_atan
Apply atan to a float column.
apply_atan2
Apply atan2(y, x) to two float columns.
apply_atanh
apply_base64
Base64 encode string bytes (PySpark base64). Input string UTF-8, output base64 string.
apply_bin
Apply bin: integer to binary string (PySpark bin).
apply_bit_and
Apply bitwise AND for two integer columns (PySpark bit_and).
apply_bit_count
Apply bit_count: count set bits in integer (PySpark bit_count).
apply_bit_or
Apply bitwise OR for two integer columns (PySpark bit_or).
apply_bit_xor
Apply bitwise XOR for two integer columns (PySpark bit_xor).
apply_bitmap_construct_agg
Build one bitmap from a list of bit positions (0..32767). Used after implode for bitmap_construct_agg.
apply_bitmap_count
Count set bits in a bitmap (binary column). PySpark bitmap_count.
apply_bitmap_or_agg
Bitwise OR of a list of bitmaps (binary). Used after implode for bitmap_or_agg.
apply_bround
Apply bround (banker’s rounding) to a float column.
apply_cbrt
apply_char
Int column to single-character string (PySpark char / chr). Valid codepoint only.
apply_conv
Apply conv (base conversion). String: parse from from_base, format in to_base. Int: format value in to_base.
apply_convert_timezone
convert_timezone(source_tz, target_tz, ts_col) - convert between timezones. Same instant.
apply_cos
Apply cos (radians) to a float column.
apply_cosh
Hyperbolic and inverse hyperbolic / extra math.
apply_cot
Apply cot (1/tan) to a float column.
apply_crc32
Apply CRC32 to string bytes (PySpark crc32).
apply_csc
Apply csc (1/sin) to a float column.
apply_date_from_unix_date
date_from_unix_date(column) - days since epoch to date.
apply_dayname
dayname(date_col) - weekday name “Mon”,“Tue”,… (PySpark dayname).
apply_decode
Decode binary (hex string) to string (PySpark decode). Charset: UTF-8.
apply_degrees
Apply degrees (radians -> degrees) to a float column.
apply_encode
Encode string to binary (PySpark encode). Charset: UTF-8, hex. Returns hex string representation of bytes.
apply_expm1
apply_factorial
factorial(column) - element-wise factorial.
apply_find_in_set
Find 1-based index of str in comma-delimited set (PySpark find_in_set). Returns 0 if not found or if str contains comma. map_many: columns[0]=str, columns[1]=set
apply_format_number
Format numeric column as string with fixed decimal places (PySpark format_number).
apply_format_string
Format columns with printf-style format string (PySpark format_string / printf). Supports %s, %d, %i, %f, %g, %%. Null in any column yields null result.
apply_from_csv
from_csv(str_col, schema) - parse CSV string to struct (PySpark from_csv). Minimal: split by comma, up to 32 columns.
apply_from_unixtime
from_unixtime(column, format?) - seconds since epoch to formatted string.
apply_from_utc_timestamp
from_utc_timestamp(ts_col, tz) - interpret ts as UTC, convert to tz. Timestamps stored as UTC micros; instant unchanged.
apply_get
Get value for key from map, or null (PySpark get).
apply_getbit
Apply getbit: get bit at 0-based position (PySpark getbit).
apply_greatest2
Element-wise max of two columns (for greatest). Supports Float64, Int64, String.
apply_hash_one
hash one column (PySpark hash) - uses Murmur3 32-bit for parity with PySpark.
apply_hash_struct
hash struct (multiple columns combined) - PySpark hash (Murmur3).
apply_hex
Apply hex: integer or string to hex string (PySpark hex).
apply_hour
hour(column) - extract hour (0-23). Accepts string timestamp column (#403).
apply_json_array_length
json_array_length(json_str, path) - length of JSON array at path (PySpark json_array_length).
apply_json_object_keys
json_object_keys(json_str) - return list of keys of JSON object (PySpark json_object_keys).
apply_json_tuple
json_tuple(json_str, key1, key2, …) - extract keys from JSON; returns struct with one field per key (PySpark json_tuple).
apply_least2
Element-wise min of two columns (for least).
apply_levenshtein
Levenshtein distance between two string columns (element-wise).
apply_log2
apply_log1p
apply_log10
apply_make_date
make_date(year, month, day) - three columns to date.
apply_make_timestamp
make_timestamp(year, month, day, hour, min, sec, timezone?) - six columns to timestamp (micros). When timezone is Some(tz_str), components are interpreted as local time in that zone, then converted to UTC.
apply_map_concat
Merge two map columns (PySpark map_concat). Last value wins for duplicate keys.
apply_map_contains_key
True if map contains key (PySpark map_contains_key).
apply_map_from_arrays
Build map (list of structs {key, value}) from two list columns. PySpark map_from_arrays.
apply_map_zip_to_struct
Merge two maps into List(Struct{key, value1, value2}) for map_zip_with. Union of keys.
apply_md5
MD5 hash of string bytes, return hex string (PySpark md5).
apply_minute
minute(column) - extract minute. Accepts string timestamp column (#403).
apply_months_between
months_between(end, start, round_off) - returns fractional number of months. When round_off is true, rounds to 8 decimal places (PySpark default).
apply_next_day
apply_parse_url
parse_url(url_str, part, key) - extract URL component (PySpark parse_url). When part is QUERY/QUERYSTRING and key is Some(k), returns the value for that query parameter only.
apply_pmod
pmod(dividend, divisor) - positive modulus.
apply_pyspark_add
PySpark-style addition with string/number coercion for Python Column operators.
apply_pyspark_divide
PySpark-style true division with string/number coercion for Python Column operators. Division by zero yields null (Spark/PySpark parity; issue #218).
apply_pyspark_mod
PySpark-style modulo with string/number coercion for Python Column operators.
apply_pyspark_multiply
PySpark-style multiplication with string/number coercion for Python Column operators.
apply_pyspark_subtract
PySpark-style subtraction with string/number coercion for Python Column operators.
apply_radians
Apply radians (degrees -> radians) to a float column.
apply_rand_with_seed
Apply rand: uniform [0, 1) per row, with optional seed (PySpark rand).
apply_randn_with_seed
Apply randn: standard normal per row, with optional seed (PySpark randn).
apply_regexp_extract_lookaround
regexp_extract using fancy-regex when pattern has lookahead/lookbehind (PySpark parity). Polars str().extract() uses regex crate which does not support lookaround.
apply_regexp_instr
Regexp instr: 1-based position of first regex match (PySpark regexp_instr). group_idx: 0 = full match, 1+ = capture group. Returns null if no match.
apply_rint
apply_round
Apply round to given decimal places. Supports numeric and string columns (PySpark parity: string columns containing numeric values are implicitly cast to double then rounded).
apply_sec
Apply sec (1/cos) to a float column.
apply_second
second(column) - extract second. Accepts string timestamp column (#403).
apply_sequence
Build array [start, start+step, …] up to but not past stop (PySpark sequence). Input column is a struct with fields “0”=start, “1”=stop, “2”=step (step optional, default 1).
apply_sha1
SHA1 hash of string bytes, return hex string (PySpark sha1).
apply_sha2
SHA2 hash of string bytes, return hex string (PySpark sha2). bit_length 256 or 384 or 512.
apply_shift_right_unsigned
shiftRightUnsigned - logical right shift for i64 (PySpark shiftRightUnsigned).
apply_shuffle
Random permutation of list elements (PySpark shuffle). Uses rand::seq::SliceRandom.
apply_signum
Apply signum (-1, 0, or 1) to a numeric column.
apply_sin
Apply sin (radians) to a float column.
apply_sinh
apply_soundex
Apply soundex to a string column; returns a new Column (Series).
apply_split_part_regex
Split string by regex and return 1-based part (for split_part with regex delimiter).
apply_split_with_limit
Split string by delimiter with at most limit parts; remainder in last part (PySpark split with limit). Returns List(String). When limit <= 0, splits without limit.
apply_str_to_map
Parse string to map: “k1:v1,k2:v2” -> List(Struct{key, value}) (PySpark str_to_map).
apply_string_to_boolean
Apply string-to-boolean cast. Handles string columns; passes through boolean; numeric types (0/0.0 -> false, non-zero -> true for PySpark parity #399); null for others (try_cast) or error (cast).
apply_string_to_date
Apply string-to-date cast. Handles string columns (accepts date and datetime strings, Spark parity); passes through date; casts datetime to date; others error (cast) or null (try_cast).
apply_string_to_date_format
Apply string-to-date with optional format (PySpark to_date(col, format)). When format is None uses default parsing; when Some parses with given format.
apply_string_to_double
Apply string-to-double cast. Handles string columns: empty/invalid -> null (Spark parity); passes through numeric columns; others error (strict) or null.
apply_string_to_int
Apply string-to-int cast. Handles string columns: empty/invalid -> null (Spark parity); passes through int columns; others error (strict) or null.
apply_struct_with_field
Replace or add a struct field (PySpark withField). Used when Polars 0.53+ no longer accepts “*” in with_fields.
apply_tan
Apply tan (radians) to a float column.
apply_tanh
apply_to_binary
to_binary(expr, fmt): PySpark to_binary. fmt ‘utf-8’ => hex(utf8 bytes), ‘hex’ => validate and return hex. Returns hex string.
apply_to_csv
to_csv(struct_col) - format struct as CSV string (PySpark to_csv). Minimal: uses struct cast to string.
apply_to_timestamp_format
to_timestamp(column, format?) / try_to_timestamp(column, format?) - string to timestamp. When format is Some, parse with that format (PySpark-style mapped to chrono); when None, use default. Strips whitespace from string values before parsing (PySpark parity #273). strict: true for to_timestamp (error on invalid), false for try_to_timestamp (null on invalid).
apply_to_timestamp_ltz_format
Parse string as timestamp in local timezone, return UTC micros (PySpark to_timestamp_ltz).
apply_to_timestamp_ntz_format
Parse string as timestamp without timezone (PySpark to_timestamp_ntz). Returns Datetime(_, None).
apply_to_utc_timestamp
to_utc_timestamp(ts_col, tz) - interpret ts as in tz, convert to UTC. For UTC-stored timestamps, instant unchanged.
apply_try_add
try_add: returns null on overflow.
apply_try_aes_decrypt
try_aes_decrypt: same as aes_decrypt, returns null on failure (PySpark try_aes_decrypt).
apply_try_multiply
try_multiply: returns null on overflow.
apply_try_subtract
try_subtract: returns null on overflow.
apply_try_to_binary
try_to_binary: like to_binary but returns null on failure.
apply_typeof
typeof: return dtype as string (PySpark typeof).
apply_unbase64
Base64 decode to string (PySpark unbase64). Output UTF-8 string; invalid decode → null.
apply_unhex
Apply unhex: hex string to binary/string (PySpark unhex).
apply_unix_date
unix_date(column) - date to days since 1970-01-01.
apply_unix_timestamp
unix_timestamp(column, format?) - parse string to seconds since epoch.
apply_url_decode
url_decode(column) - percent-decode URL-encoded string (PySpark url_decode).
apply_url_encode
url_encode(column) - percent-encode string for URL (PySpark url_encode).
apply_weekday
weekday(date_col) - 0=Mon, 6=Sun (PySpark weekday).
apply_xxhash64
Apply XXH64 hash (PySpark xxhash64).
apply_zip_arrays_to_struct
Zip two array columns into List(Struct{left, right}) for zip_with. Shorter padded with null.
series_rand_n
Build a Series of n uniform [0, 1) values with optional seed (for with_column PySpark-like rand).
series_randn_n
Build a Series of n standard normal values with optional seed (for with_column PySpark-like randn).