Module functions

Source
Expand description

A re-implementation of Spark functions

Functions§

abs
Computes the absolute value.
acos
Computes inverse cosine of the input column.
acosh
Computes inverse hyperbolic cosine of the input column.
add_months
Returns the date that is months months after start.
aes_decrypt
Returns a decrypted value of input using AES in mode with padding
aes_encrypt
Returns a encrypted value of input using AES in mode with the specified padding
any_value
Returns some value of col for a group of rows.
approx_count_distinct
Returns a new Column for approximate distinct count of column col.
array
Creates a new array column.
array_agg
Returns a list of objects with duplicates.
array_append
Returns an array of the elements in col1 along with the added element in col2 at the last of the array.
array_compact
Removes null values from the array.
array_contains
Returns null if the array is null, true if the array contains the given value, and false otherwise.
array_distinct
Removes duplicate values from the array.
array_except
Returns an array of the elements in col1 but not in col2, without duplicates.
array_insert
adds an item into a given array at a specified array index.
array_intersect
Returns an array of the elements in the intersection of col1 and col2, without duplicates.
array_join
Concatenates the elements of column using the delimiter.
array_max
Returns the maximum value of the array.
array_min
Returns the minimum value of the array.
array_position
Locates the position of the first occurrence of the given value in the given array.
array_prepend
Returns an array containing element as well as all elements from array.
array_remove
Remove all elements that equal to element from the given array.
array_repeat
Creates an array containing a column repeated count times.
array_size
Returns the total number of elements in the array.
array_sort
array_union
Returns an array of the elements in the union of col1 and col2, without duplicates.
arrays_overlap
Returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
arrays_zip
Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
asc
Returns a sort expression based on the ascending order of the given column name.
asc_nulls_first
Returns a sort expression based on the ascending order of the given column name, and null values return before non-null values.
asc_nulls_last
Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
ascii
Computes the numeric value of the first character of the string column.
asin
Computes inverse sine of the input column.
asinh
Computes inverse hyperbolic sine of the input column.
assert_true
Returns null if the input column is true; throws an exception with the provided error message otherwise
atan
Compute inverse tangent of the input column.
atan2
Computes inverse hyperbolic tangent of the input columns.
atanh
Computes inverse hyperbolic tangent of the input column.
avg
Returns the average of the values in a group.
base64
Computes the BASE64 encoding of a binary column and returns it as a string column.
bin
Returns the string representation of the binary value of the given column.
bit_and
Returns the bitwise AND of all non-null input values, or null if none.
bit_count
Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.
bit_get
Returns the value of the bit (0 or 1) at the specified position.
bit_length
Calculates the bit length for the specified string column.
bit_or
Returns the bitwise OR of all non-null input values, or null if none.
bit_xor
Returns the bitwise XOR of all non-null input values, or null if none.
bitmap_bit_position
Returns the bit position for the given input column.
bitmap_bucket_number
Returns the bucket number for the given input column.
bitmap_construct_agg
Returns a bitmap with the positions of the bits set from all the values from the input column.
bitmap_count
Returns the number of set bits in the input bitmap.
bitmap_or_agg
Returns a bitmap that is the bitwise OR of all of the bitmaps from the input column.
bitwise_not
Computes bitwise not.
bool_and
Returns true if all values of col are true.
bool_or
Returns true if at least one value of col is true.
broadcast
Marks a DataFrame as small enough for use in broadcast joins.
bround
Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
btrim
bucket
A transform for any type that partitions by a hash of the input column.
cardinality
Returns the length of the array or map stored in the column.
cbrt
Computes the cube-root of the given value.
ceil
Computes the ceiling of the given value.
ceiling
Computes the ceiling of the given value.
char
Returns the ASCII character having the binary equivalent to col.
char_length
Returns the character length of string data or number of bytes of binary data.
character_length
Returns the character length of string data or number of bytes of binary data.
coalesce
Returns the first column that is not null.
col
Returns a Column based on the given column name.
collect_list
Returns a list of objects with duplicates.
collect_set
Returns a set of objects with duplicate elements eliminated.
column
Returns a Column based on the given column name.
concat
Concatenates multiple input columns together into a single column.
concat_ws
contains
Returns a boolean.
conv
Convert a number in a string column from one base to another.
convert_timezone
Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.
corr
Returns a new Column for the Pearson Correlation Coefficient for col1 and col2.
cos
Computes cosine of the input column.
cosh
Computes hyperbolic cosine of the input column.
cot
Computes cotangent of the input column.
count
Returns the number of items in a group.
count_distinct
count_if
Returns the number of TRUE values for the col.
count_min_sketch
Returns a count-min sketch of a column with the given esp, confidence and seed.
covar_pop
Returns a new Column for the population covariance of col1 and col2.
covar_samp
Returns a new Column for the sample covariance of col1 and col2.
crc32
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
create_map
Create a new map column.
csc
Computes cosecant of the input column.
cume_dist
Returns the cumulative distribution of values within a window partition, i.e.
curdate
Returns the current date at the start of query evaluation as a DateType column.
current_catalog
Returns the current catalog.
current_database
Returns the current database.
current_date
Returns the current date at the start of query evaluation as a DateType column.
current_schema
Returns the current database.
current_timestamp
Returns the current timestamp at the start of query evaluation as a TimestampType column.
current_timezone
Returns the current session local timezone.
current_user
Returns the current user.
date_add
Returns the date that is days days after start.
date_diff
Returns the number of days from start to end.
date_format
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
date_from_unix_date
Create date from the number of days since 1970-01-01.
date_part
Extracts a part of the date/timestamp or interval source.
date_sub
Returns the date that is days days before start.
date_trunc
Returns timestamp truncated to the unit specified by the format.
dateadd
Returns the date that is days days after start.
datediff
Returns the number of days from start to end.
day
Extract the day of the month of a given date/timestamp as integer.
dayofmonth
Extract the day of the month of a given date/timestamp as integer.
dayofweek
Extract the day of the week of a given date/timestamp as integer.
dayofyear
Extract the day of the year of a given date/timestamp as integer.
days
A transform for timestamps and dates to partition data into days.
decode
Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
degrees
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
dense_rank
Returns the rank of rows within a window partition, without any gaps
desc
Returns a sort expression based on the descending order of the given column name.
desc_nulls_first
Returns a sort expression based on the descending order of the given column name, and null values appear before non-null values.
desc_nulls_last
Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.
e
Returns Euler’s number.
element_at
Returns element of array at given index in extraction if col is array.
elt
Returns the n-th input, e.g., returns input2 when n is 2.
encode
Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
endswith
Returns a boolean.
equal_null
Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
every
Returns true if all values of col are true.
exp
Computes the exponential of the given value.
explode
Returns a new row for each element in the given array or map.
explode_outer
Returns a new row for each element in the given array or map.
expm1
Computes the exponential of the given value minus one.
expr
Parses the expression string into the column that it represents
extract
Extracts a part of the date/timestamp or interval source.
factorial
Computes the factorial of the given value.
find_in_set
Returns the index (1-based) of the given string (str) in the comma-delimited list (strArray).
first
Returns the first value in a group.
first_value
Returns the first value of col for a group of rows.
flatten
Creates a single array from an array of arrays.
floor
Computes the floor of the given value.
format_number
Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string.
format_string
from_csv
Parses a column containing a CSV string to a row with the specific schema. Returns null in the case of an unparseable string
from_json
Parses a column containing a JSON string to a row with the specific schema. Returns null in the case of an unparseable string
from_unixtime
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
from_utc_timestamp
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
get
Returns element of array at given (0-based) index.
get_json_object
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
getbit
Returns the value of the bit (0 or 1) at the specified position.
greatest
Returns the greatest value of the list of column names, skipping null values.
grouping
Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
grouping_id
Returns the level of grouping, equals to
hash
Calculates the hash code of given columns, and returns the result as an int column.
hex
Computes hex value of the given column
histogram_numeric
Computes a histogram on numeric ‘col’ using nb bins.
hll_sketch_agg
Returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
hll_sketch_estimate
Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
hll_union
hll_union_agg
Returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
hour
Extract the hours of a given timestamp as integer.
hours
A transform for timestamps to partition data into hours.
hypot
Computes sqrt(a^2 + b^2) without intermediate overflow or underflow.
ifnull
Returns col2 if col1 is null, or col1 otherwise.
ilike
initcap
Translate the first letter of each word to upper case in the sentence.
inline
Explodes an array of structs into a table.
inline_outer
Explodes an array of structs into a table.
input_file_block_length
Returns the length of the block being read, or -1 if not available.
input_file_block_start
Returns the start offset of the block being read, or -1 if not available.
input_file_name
Creates a string column for the file name of the current Spark task.
instr
Locate the position of the first occurrence of substr column in the given string.
isnan
An expression that returns true if the column is NaN.
isnotnull
Returns true if col is not null, or false otherwise.
isnull
An expression that returns true if the column is null
java_method
Calls a method with reflection.
json_array_length
Returns the number of elements in the outermost JSON array.
json_object_keys
Returns all the keys of the outermost JSON object as an array.
json_tuple
Creates a new row for a json column according to the given field names.
kurtosis
Returns the kurtosis of the values in a group.
lag
Returns the value that os offset rows before the current row, and default is there is less than offset rows before the current row
last
Returns the last value in a group.
last_day
Returns the last day of the month which the given date belongs to.
last_value
Returns the last value of col for a group of rows.
lcase
Returns str with all characters changed to lowercase.
lead
Returns the value that os offset rows after the current row, and default is there is less than offset rows after the current row
least
Returns the least value of the list of column names, skipping null values.
left
Returns the leftmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.
length
Computes the character length of string data or number of bytes of binary data.
levenshtein
like
lit
Creates a Column of spark::expression::Literal value.
ln
Returns the natural logarithm of the argument.
localtimestamp
Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
locate
log
Returns the first argument-based logarithm of the second argument.
log2
Returns the base-2 logarithm of the argument.
log1p
Computes the natural logarithm of the “given value plus one”.
log10
Computes the logarithm of the given value in Base 10.
lower
Converts a string expression to lower case.
lpad
Left-pad the string column to width len with pad.
ltrim
Trim the spaces from left end for the specified string value.
make_date
Returns a column with a date built from the year, month and day columns.
make_dt_interval
Make DayTimeIntervalType duration from days, hours, mins and secs.
make_interval
Make interval from years, months, weeks, days, hours, mins and secs.
make_timestamp
Create timestamp from years, months, days, hours, mins, secs and timezone fields.
make_timestamp_ltz
Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.
make_timestamp_ntz
Create local date-time from years, months, days, hours, mins, secs fields.
make_ym_interval
Make year-month interval from years, months.
map_concat
Returns the union of all the given maps.
map_contains_key
Returns true if the map contains the key.
map_entries
Returns an unordered array of all entries in the given map.
map_from_arrays
Creates a new map from two arrays.
map_from_entries
Converts an array of entries (key value struct types) to a map of values.
map_keys
Returns an unordered array containing the keys of the map.
map_values
Returns an unordered array containing the values of the map.
max
Returns the maximum value of the expression in a group.
max_by
Returns the value associated with the maximum value of ord.
md5
Calculates the MD5 digest and returns the value as a 32 character hex string.
mean
returns the average of the values in a group.
median
Returns the median of the values in a group
min
Returns the minimum value of the expression in a group.
min_by
Returns the value associated with the minimum value of ord.
minute
Extract the minutes of a given timestamp as integer.
mode
Returns the most frequent value in a group.
monotonically_increasing_id
A column that generates monotonically increasing 64-bit integers.
month
Extract the month of a given date/timestamp as integer.
months
A transform for timestamps and dates to partition data into months.
months_between
Returns number of months between dates date1 and date2.
named_struct
Creates a struct with the given field names and values.
nanvl
Returns col1 if it is not NaN, or col2 if col1 is NaN.
negate
Returns the negative value.
negative
Returns the negative value.
next_day
Returns the first date which is later than the value of the date column based on second week day argument.
now
Returns the current timestamp at the start of query evaluation.
nth_value
ntile
Returns the ntile group id (from 1 to n inclusive) in an ordered window partition.
nullif
Returns null if col1 equals to col2, or col1 otherwise.
nvl
Returns col2 if col1 is null, or col1 otherwise.
nvl2
Returns col2 if col1 is not null, or col3 otherwise.
octet_length
Calculates the byte length for the specified string column.
overlay
parse_url
percent_rank
Returns the relative rank
percentile
Returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
percentile_approx
Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
pi
Returns Pi.
pmod
Returns the positive value of dividend mod divisor.
posexplode
Returns a new row for each element with position in the given array or map.
posexplode_outer
Returns a new row for each element with position in the given array or map.
position
positive
Returns the value.
pow
Returns the value of the first argument raised to the power of the second argument.
power
Returns the value of the first argument raised to the power of the second argument.
printf
product
Returns the product of the values in a group.
quarter
Extract the quarter of a given date/timestamp as integer.
radians
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
raise_error
Throws an exception with the provided error message.
rand
Generates a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
rank
Returns the rank of rows within a window partition.
reflect
Calls a method with reflection.
regexp
Returns true if str matches the Java regex regexp, or false otherwise.
regexp_count
Returns a count of the number of times that the Java regex pattern regexp is matched in the string str.
regexp_extract
regexp_extract_all
regexp_instr
regexp_like
Returns true if str matches the Java regex regexp, or false otherwise.
regexp_replace
Replace all substrings of the specified string value that match regexp with replacement
regexp_substr
Returns the substring that matches the Java regex regexp within the string str.
regr_avgx
Returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_avgy
Returns the average of the dependent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_count
Returns the number of non-null number pairs in a group, where y is the dependent variable and x is the independent variable.
regr_intercept
Returns the intercept of the univariate linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_r2
Returns the coefficient of determination for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_slope
Returns the slope of the linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_sxx
Returns REGR_COUNT(y, x) * VAR_POP(x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_sxy
Returns REGR_COUNT(y, x) * COVAR_POP(y, x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
regr_syy
Returns REGR_COUNT(y, x) * VAR_POP(y) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
repeat
Repeats a string column n times, and returns it as a new string column.
replace
reverse
Returns a reversed string or an array with reverse order of elements.
right
Returns the rightmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.
rint
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
rlike
Returns true if str matches the Java regex regexp, or false otherwise.
round
Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.
row_number
Returns a sequential number starting at 1 within a window partition.
rpad
Right-pad the string column to width len with pad.
rtrim
Trim the spaces from right end for the specified string value.
schema_of_csv
Parses a CSV string and infers its schema in DDL format
schema_of_json
Parses a JSON string and infers its schema in DDL format
sec
Computes secant of the input column.
second
Extract the seconds of a given date as integer.
sentences
Splits a string into arrays of sentences, where each sentence is an array of words
sequence
Generate a sequence of integers from start to stop, incrementing by step.
session_window
Generates session window given a timestamp specifying column.
sha
Returns a sha1 hash value as a hex string of the col.
sha1
Returns the hex string result of SHA-1.
sha2
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
shiftleft
Shift the given value numBits left.
shiftright
(Signed) shift the given value numBits right.
shiftrightunsigned
(Signed) shift the given value numBits right.
shuffle
Generates a random permutation of the given array.
sign
Computes the signum of the given value.
signum
Computes the signum of the given value.
sin
Computes sine of the input column.
sinh
Computes hyperbolic sine of the input column.
size
Returns the length of the array or map stored in the column.
skewness
Returns the skewness of the values in a group.
slice
Returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
some
Returns true if at least one value of col is true.
sort_array
Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
soundex
Returns the SoundEx encoding for a string
spark_partition_id
A column for partition ID.
split
Splits str around matches of the given pattern.
split_part
Splits str by delimiter and return requested part of the split (1-based).
sqrt
Computes the square root of the specified float value.
stack
Separates col1, …, colk into n rows
startswith
Returns a boolean.
std
Alias for stddev_samp.
stddev
Alias for stddev_samp.
stddev_pop
Returns population standard deviation of the expression in a group.
stddev_samp
Returns the unbiased sample standard deviation of the expression in a group.
str_to_map
Create a map after splitting the text into key/value pairs using delimiters
struct_col
Creates a new struct column.
substr
substring
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
substring_index
Returns the substring from string str before count occurrences of the delimiter delim.
sum
Returns the sum of all values in the expression.
sum_distinct
Returns the sum of distinct values in the expression.
tan
Computes tangent of the input column.
tanh
Computes hyperbolic tangent of the input column.
timestamp_micros
Creates timestamp from the number of microseconds since UTC epoch.
timestamp_millis
Creates timestamp from the number of milliseconds since UTC epoch.
timestamp_seconds
Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
to_binary
to_char
Convert col to a string based on the format.
to_csv
Converts a column containing a StructType into a CSV string
to_date
Parses the timestamp with the format to a timestamp without time zone.
to_json
Converts a column containing a StructType into a JSON string
to_number
Convert string ‘col’ to a number based on the string format ‘format’
to_timestamp
Converts a Column into pyspark.sql.types.TimestampType using the optionally specified format.
to_timestamp_ltz
Parses the timestamp with the format to a timestamp without time zone.
to_timestamp_ntz
Parses the timestamp with the format to a timestamp without time zone.
to_unix_timestamp
Returns the UNIX timestamp of the given time.
to_utc_timestamp
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
to_varchar
Convert col to a string based on the format.
translate
A function translate any character in the srcCol by a character in matching.
trim
Trim the spaces from both ends for the specified string column.
trunc
Returns date truncated to the unit specified by the format.
try_add
Returns the sum of left and right and the result is null on overflow.
try_aes_decrypt
try_avg
Returns the mean calculated from values of a group and the result is null on overflow.
try_divide
Returns dividend/divisor.
try_element_at
Returns element of array at given (1-based) index.
try_multiply
Returns left*right and the result is null on overflow.
try_subtract
Returns left-right and the result is null on overflow.
try_sum
Returns sum calculated from values of a group and the result is null on overflow.
try_to_binary
try_to_number
Convert string ‘col’ to a number based on the string format ‘format’
try_to_timestamp
ucase
Returns str with all characters changed to uppercase.
unbase64
Decodes a BASE64 encoded string column and returns it as a binary column.
unhex
Inverse of hex.
unix_date
Returns the number of days since 1970-01-01.
unix_micros
Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
unix_millis
Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
unix_seconds
Returns the number of seconds since 1970-01-01 00:00:00 UTC.
unix_timestamp
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, returns null if failed.
upper
Converts a string expression to upper case.
url_decode
Decodes a str in ‘application/x-www-form-urlencoded’ format using a specific encoding scheme.
url_encode
Translates a string into ‘application/x-www-form-urlencoded’ format using a specific encoding scheme.
user
Returns the current database.
var_pop
Returns the population variance of the values in a group.
var_samp
Returns the unbiased sample variance of the values in a group.
variance
Alias for var_samp
version
Returns the Spark version.
weekday
Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, …, 6 = Sunday).
weekofyear
Extract the week number of a given date as integer.
width_bucket
Returns the bucket number into which the value of this expression would fall after being evaluated.
window
Bucketize rows into one or moe time windows given a timestamp specifying column.
window_time
Computes the event time from a window column.
xpath
Returns a string array of values within the nodes of xml that match the XPath expression.
xpath_boolean
Returns true if the XPath expression evaluates to true, or if a matching node is found.
xpath_double
Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
xpath_float
Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
xpath_int
Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
xpath_long
Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
xpath_number
Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
xpath_short
Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
xpath_string
Returns the text contents of the first xml node that matches the XPath expression.
xxhash64
Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column.
year
Extract the year of a given date/timestamp as integer.
years
A transform for timestamps and dates to partition data into years.