Module udfs

Expand description

Helpers for element-wise UDFs used by map() expressions (soundex, levenshtein, crc32, xxhash64, array_flatten, array_repeat). These run at plan execution time when Polars invokes the closure.

Functions§

apply_acos: Apply acos to a float column.
apply_acosh
apply_add_months: add_months(date_column, n) - add n months to each date.
apply_aes_decrypt: AES decrypt (PySpark aes_decrypt). Input hex(nonce||ciphertext). Returns null on failure.
apply_aes_encrypt: AES encrypt (PySpark aes_encrypt). Key as string; uses AES-128-GCM. Output hex.
apply_array_append: Append element to end of each list (PySpark array_append).
apply_array_distinct_first_order: Distinct elements in list preserving first-occurrence order (PySpark array_distinct parity).
apply_array_except: Elements in first array not in second (PySpark array_except).
apply_array_flatten: Flatten list-of-lists to a single list per row (PySpark flatten).
apply_array_insert: Insert element at 1-based position (PySpark array_insert). Negative pos = from end.
apply_array_intersect: Elements in both arrays (PySpark array_intersect). Distinct.
apply_array_prepend: Prepend element to start of each list (PySpark array_prepend).
apply_array_repeat: Repeat each element n times (PySpark array_repeat). Supports both: (1) scalar column (string, int, etc.) - create array of n copies; (2) List column - repeat each element within the list.
apply_array_union: Distinct elements from both arrays (PySpark array_union).
apply_arrays_overlap: True if two arrays have any element in common (PySpark arrays_overlap).
apply_arrays_zip: Zip two arrays into array of structs (PySpark arrays_zip).
apply_ascii: ASCII value of first character (PySpark ascii). Returns Int32.
apply_asin: Apply asin to a float column.
apply_asinh
apply_assert_true: Assert that all boolean values are true (PySpark assert_true). PySpark: returns null when input is true; throws when input is false or null. When err_msg is Some, it is used in the error message when assertion fails.
apply_atan: Apply atan to a float column.
apply_atan2: Apply atan2(y, x) to two float columns.
apply_atanh
apply_base64: Base64 encode string bytes (PySpark base64). Input string UTF-8, output base64 string.
apply_bin: Apply bin: integer to binary string (PySpark bin).
apply_bit_and: Apply bitwise AND for two integer columns (PySpark bit_and).
apply_bit_count: Apply bit_count: count set bits in integer (PySpark bit_count).
apply_bit_or: Apply bitwise OR for two integer columns (PySpark bit_or).
apply_bit_xor: Apply bitwise XOR for two integer columns (PySpark bit_xor).
apply_bitmap_construct_agg: Build one bitmap from a list of bit positions (0..32767). Used after implode for bitmap_construct_agg.
apply_bitmap_count: Count set bits in a bitmap (binary column). PySpark bitmap_count.
apply_bitmap_or_agg: Bitwise OR of a list of bitmaps (binary). Used after implode for bitmap_or_agg.
apply_bround: Apply bround (banker’s rounding) to a float column.
apply_cbrt
apply_char: Int column to single-character string (PySpark char / chr). Valid codepoint only.
apply_conv: Apply conv (base conversion). String: parse from from_base, format in to_base. Int: format value in to_base.
apply_convert_timezone: convert_timezone(source_tz, target_tz, ts_col) - convert between timezones. Same instant.
apply_cos: Apply cos (radians) to a float column.
apply_cosh: Hyperbolic and inverse hyperbolic / extra math.
apply_cot: Apply cot (1/tan) to a float column.
apply_crc32: Apply CRC32 to string bytes (PySpark crc32).
apply_csc: Apply csc (1/sin) to a float column.
apply_date_from_unix_date: date_from_unix_date(column) - days since epoch to date.
apply_dayname: dayname(date_col) - weekday name “Mon”,“Tue”,… (PySpark dayname).
apply_decode: Decode binary (hex string) to string (PySpark decode). Charset: UTF-8.
apply_degrees: Apply degrees (radians -> degrees) to a float column.
apply_encode: Encode string to binary (PySpark encode). Charset: UTF-8, hex. Returns hex string representation of bytes.
apply_expm1
apply_factorial: factorial(column) - element-wise factorial.
apply_find_in_set: Find 1-based index of str in comma-delimited set (PySpark find_in_set). Returns 0 if not found or if str contains comma. map_many: columns[0]=str, columns[1]=set
apply_format_number: Format numeric column as string with fixed decimal places (PySpark format_number).
apply_format_string: Format columns with printf-style format string (PySpark format_string / printf). Supports %s, %d, %i, %f, %g, %%. Null in any column yields null result.
apply_from_csv: from_csv(str_col, schema) - parse CSV string to struct (PySpark from_csv). Minimal: split by comma, up to 32 columns.
apply_from_unixtime: from_unixtime(column, format?) - seconds since epoch to formatted string.
apply_from_utc_timestamp: from_utc_timestamp(ts_col, tz) - interpret ts as UTC, convert to tz. Timestamps stored as UTC micros; instant unchanged.
apply_get: Get value for key from map, or null (PySpark get).
apply_getbit: Apply getbit: get bit at 0-based position (PySpark getbit).
apply_greatest2: Element-wise max of two columns (for greatest). Supports Float64, Int64, String.
apply_hash_one: hash one column (PySpark hash) - uses Murmur3 32-bit for parity with PySpark.
apply_hash_struct: hash struct (multiple columns combined) - PySpark hash (Murmur3).
apply_hex: Apply hex: integer or string to hex string (PySpark hex).
apply_hour: hour(column) - extract hour (0-23). Accepts string timestamp column (#403).
apply_json_array_length: json_array_length(json_str, path) - length of JSON array at path (PySpark json_array_length).
apply_json_object_keys: json_object_keys(json_str) - return list of keys of JSON object (PySpark json_object_keys).
apply_json_tuple: json_tuple(json_str, key1, key2, …) - extract keys from JSON; returns struct with one field per key (PySpark json_tuple).
apply_least2: Element-wise min of two columns (for least).
apply_levenshtein: Levenshtein distance between two string columns (element-wise).
apply_log2
apply_log1p
apply_log10
apply_make_date: make_date(year, month, day) - three columns to date.
apply_make_timestamp: make_timestamp(year, month, day, hour, min, sec, timezone?) - six columns to timestamp (micros). When timezone is Some(tz_str), components are interpreted as local time in that zone, then converted to UTC.
apply_map_concat: Merge two map columns (PySpark map_concat). Last value wins for duplicate keys.
apply_map_contains_key: True if map contains key (PySpark map_contains_key).
apply_map_from_arrays: Build map (list of structs {key, value}) from two list columns. PySpark map_from_arrays.
apply_map_zip_to_struct: Merge two maps into List(Struct{key, value1, value2}) for map_zip_with. Union of keys.
apply_md5: MD5 hash of string bytes, return hex string (PySpark md5).
apply_minute: minute(column) - extract minute. Accepts string timestamp column (#403).
apply_months_between: months_between(end, start, round_off) - returns fractional number of months. When round_off is true, rounds to 8 decimal places (PySpark default).
apply_next_day
apply_parse_url: parse_url(url_str, part, key) - extract URL component (PySpark parse_url). When part is QUERY/QUERYSTRING and key is Some(k), returns the value for that query parameter only.
apply_pmod: pmod(dividend, divisor) - positive modulus.
apply_pyspark_add: PySpark-style addition with string/number coercion for Python Column operators.
apply_pyspark_divide: PySpark-style true division with string/number coercion for Python Column operators. Division by zero yields null (Spark/PySpark parity; issue #218).
apply_pyspark_mod: PySpark-style modulo with string/number coercion for Python Column operators.
apply_pyspark_multiply: PySpark-style multiplication with string/number coercion for Python Column operators.
apply_pyspark_subtract: PySpark-style subtraction with string/number coercion for Python Column operators.
apply_radians: Apply radians (degrees -> radians) to a float column.
apply_rand_with_seed: Apply rand: uniform [0, 1) per row, with optional seed (PySpark rand).
apply_randn_with_seed: Apply randn: standard normal per row, with optional seed (PySpark randn).
apply_regexp_extract_lookaround: regexp_extract using fancy-regex when pattern has lookahead/lookbehind (PySpark parity). Polars str().extract() uses regex crate which does not support lookaround.
apply_regexp_instr: Regexp instr: 1-based position of first regex match (PySpark regexp_instr). group_idx: 0 = full match, 1+ = capture group. Returns null if no match.
apply_rint
apply_round: Apply round to given decimal places. Supports numeric and string columns (PySpark parity: string columns containing numeric values are implicitly cast to double then rounded).
apply_sec: Apply sec (1/cos) to a float column.
apply_second: second(column) - extract second. Accepts string timestamp column (#403).
apply_sequence: Build array [start, start+step, …] up to but not past stop (PySpark sequence). Input column is a struct with fields “0”=start, “1”=stop, “2”=step (step optional, default 1).
apply_sha1: SHA1 hash of string bytes, return hex string (PySpark sha1).
apply_sha2: SHA2 hash of string bytes, return hex string (PySpark sha2). bit_length 256 or 384 or 512.
apply_shift_right_unsigned: shiftRightUnsigned - logical right shift for i64 (PySpark shiftRightUnsigned).
apply_shuffle: Random permutation of list elements (PySpark shuffle). Uses rand::seq::SliceRandom.
apply_signum: Apply signum (-1, 0, or 1) to a numeric column.
apply_sin: Apply sin (radians) to a float column.
apply_sinh
apply_soundex: Apply soundex to a string column; returns a new Column (Series).
apply_split_part_regex: Split string by regex and return 1-based part (for split_part with regex delimiter).
apply_split_with_limit: Split string by delimiter with at most limit parts; remainder in last part (PySpark split with limit). Returns List(String). When limit <= 0, splits without limit.
apply_str_to_map: Parse string to map: “k1:v1,k2:v2” -> List(Struct{key, value}) (PySpark str_to_map).
apply_string_to_boolean: Apply string-to-boolean cast. Handles string columns; passes through boolean; numeric types (0/0.0 -> false, non-zero -> true for PySpark parity #399); null for others (try_cast) or error (cast).
apply_string_to_date: Apply string-to-date cast. Handles string columns (accepts date and datetime strings, Spark parity); passes through date; casts datetime to date; others error (cast) or null (try_cast).
apply_string_to_date_format: Apply string-to-date with optional format (PySpark to_date(col, format)). When format is None uses default parsing; when Some parses with given format.
apply_string_to_double: Apply string-to-double cast. Handles string columns: empty/invalid -> null (Spark parity); passes through numeric columns; others error (strict) or null.
apply_string_to_int: Apply string-to-int cast. Handles string columns: empty/invalid -> null (Spark parity); passes through int columns; others error (strict) or null.
apply_struct_with_field: Replace or add a struct field (PySpark withField). Used when Polars 0.53+ no longer accepts “*” in with_fields.
apply_tan: Apply tan (radians) to a float column.
apply_tanh
apply_to_binary: to_binary(expr, fmt): PySpark to_binary. fmt ‘utf-8’ => hex(utf8 bytes), ‘hex’ => validate and return hex. Returns hex string.
apply_to_csv: to_csv(struct_col) - format struct as CSV string (PySpark to_csv). Minimal: uses struct cast to string.
apply_to_timestamp_format: to_timestamp(column, format?) / try_to_timestamp(column, format?) - string to timestamp. When format is Some, parse with that format (PySpark-style mapped to chrono); when None, use default. Strips whitespace from string values before parsing (PySpark parity #273). strict: true for to_timestamp (error on invalid), false for try_to_timestamp (null on invalid).
apply_to_timestamp_ltz_format: Parse string as timestamp in local timezone, return UTC micros (PySpark to_timestamp_ltz).
apply_to_timestamp_ntz_format: Parse string as timestamp without timezone (PySpark to_timestamp_ntz). Returns Datetime(_, None).
apply_to_utc_timestamp: to_utc_timestamp(ts_col, tz) - interpret ts as in tz, convert to UTC. For UTC-stored timestamps, instant unchanged.
apply_try_add: try_add: returns null on overflow.
apply_try_aes_decrypt: try_aes_decrypt: same as aes_decrypt, returns null on failure (PySpark try_aes_decrypt).
apply_try_multiply: try_multiply: returns null on overflow.
apply_try_subtract: try_subtract: returns null on overflow.
apply_try_to_binary: try_to_binary: like to_binary but returns null on failure.
apply_typeof: typeof: return dtype as string (PySpark typeof).
apply_unbase64: Base64 decode to string (PySpark unbase64). Output UTF-8 string; invalid decode → null.
apply_unhex: Apply unhex: hex string to binary/string (PySpark unhex).
apply_unix_date: unix_date(column) - date to days since 1970-01-01.
apply_unix_timestamp: unix_timestamp(column, format?) - parse string to seconds since epoch.
apply_url_decode: url_decode(column) - percent-decode URL-encoded string (PySpark url_decode).
apply_url_encode: url_encode(column) - percent-encode string for URL (PySpark url_encode).
apply_weekday: weekday(date_col) - 0=Mon, 6=Sun (PySpark weekday).
apply_xxhash64: Apply XXH64 hash (PySpark xxhash64).
apply_zip_arrays_to_struct: Zip two array columns into List(Struct{left, right}) for zip_with. Shorter padded with null.
series_rand_n: Build a Series of n uniform [0, 1) values with optional seed (for with_column PySpark-like rand).
series_randn_n: Build a Series of n standard normal values with optional seed (for with_column PySpark-like randn).

Module udfs

Module udfs Copy item path

Functions§

Module udfs