Expand description
A re-implementation of Spark functions
Functions§
- Computes the absolute value.
- Computes inverse cosine of the input column.
- Computes inverse hyperbolic cosine of the input column.
- Returns the date that is months months after start.
- Returns some value of col for a group of rows.
- Returns a new Column for approximate distinct count of column col.
- Creates a new array column.
- Returns a list of objects with duplicates.
- Returns an array of the elements in col1 along with the added element in col2 at the last of the array.
- Removes null values from the array.
- Returns null if the array is null, true if the array contains the given value, and false otherwise.
- Removes duplicate values from the array.
- Returns an array of the elements in col1 but not in col2, without duplicates.
- adds an item into a given array at a specified array index.
- Returns an array of the elements in the intersection of col1 and col2, without duplicates.
- Concatenates the elements of column using the delimiter.
- Returns the maximum value of the array.
- Returns the minimum value of the array.
- Locates the position of the first occurrence of the given value in the given array.
- Returns an array containing element as well as all elements from array.
- Remove all elements that equal to element from the given array.
- Creates an array containing a column repeated count times.
- Returns the total number of elements in the array.
- Returns an array of the elements in the union of col1 and col2, without duplicates.
- Returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
- Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
- Returns a sort expression based on the ascending order of the given column name.
- Returns a sort expression based on the ascending order of the given column name, and null values return before non-null values.
- Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
- Computes the numeric value of the first character of the string column.
- Computes inverse sine of the input column.
- Computes inverse hyperbolic sine of the input column.
- Compute inverse tangent of the input column.
- Computes inverse hyperbolic tangent of the input columns.
- Computes inverse hyperbolic tangent of the input column.
- Returns the average of the values in a group.
- Computes the BASE64 encoding of a binary column and returns it as a string column.
- Returns the string representation of the binary value of the given column.
- Returns the bitwise AND of all non-null input values, or null if none.
- Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.
- Returns the value of the bit (0 or 1) at the specified position.
- Calculates the bit length for the specified string column.
- Returns the bitwise OR of all non-null input values, or null if none.
- Returns the bitwise XOR of all non-null input values, or null if none.
- Returns the bit position for the given input column.
- Returns the bucket number for the given input column.
- Returns a bitmap with the positions of the bits set from all the values from the input column.
- Returns the number of set bits in the input bitmap.
- Returns a bitmap that is the bitwise OR of all of the bitmaps from the input column.
- Computes bitwise not.
- Returns true if all values of col are true.
- Returns true if at least one value of col is true.
- Marks a DataFrame as small enough for use in broadcast joins.
- Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
- A transform for any type that partitions by a hash of the input column.
- Returns the length of the array or map stored in the column.
- Computes the cube-root of the given value.
- Computes the ceiling of the given value.
- Computes the ceiling of the given value.
- Returns the ASCII character having the binary equivalent to col.
- Returns the character length of string data or number of bytes of binary data.
- Returns the character length of string data or number of bytes of binary data.
- Returns the first column that is not null.
- Returns a Column based on the given column name.
- Returns a list of objects with duplicates.
- Returns a set of objects with duplicate elements eliminated.
- Returns a Column based on the given column name.
- Concatenates multiple input columns together into a single column.
- Returns a boolean.
- Convert a number in a string column from one base to another.
- Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.
- Returns a new Column for the Pearson Correlation Coefficient for col1 and col2.
- Computes cosine of the input column.
- Computes hyperbolic cosine of the input column.
- Computes cotangent of the input column.
- Returns the number of TRUE values for the col.
- Returns a count-min sketch of a column with the given esp, confidence and seed.
- Returns a new Column for the population covariance of col1 and col2.
- Returns a new Column for the sample covariance of col1 and col2.
- Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
- Creates a new map column.
- Computes cosecant of the input column.
- Returns the cumulative distribution of values within a window partition, i.e.
- Returns the current date at the start of query evaluation as a DateType column.
- Returns the current catalog.
- Returns the current database.
- Returns the current date at the start of query evaluation as a DateType column.
- Returns the current database.
- Returns the current timestamp at the start of query evaluation as a TimestampType column.
- Returns the current session local timezone.
- Returns the current user.
- Returns the date that is days days after start.
- Returns the number of days from start to end.
- Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
- Create date from the number of days since 1970-01-01.
- Extracts a part of the date/timestamp or interval source.
- Returns the date that is days days before start.
- Returns timestamp truncated to the unit specified by the format.
- Returns the date that is days days after start.
- Returns the number of days from start to end.
- Extract the day of the month of a given date/timestamp as integer.
- Extract the day of the month of a given date/timestamp as integer.
- Extract the day of the week of a given date/timestamp as integer.
- Extract the day of the year of a given date/timestamp as integer.
- A transform for timestamps and dates to partition data into days.
- Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
- Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
- Returns the rank of rows within a window partition, without any gaps
- Returns a sort expression based on the descending order of the given column name.
- Returns a sort expression based on the descending order of the given column name, and null values appear before non-null values.
- Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.
- Returns Euler’s number.
- Returns element of array at given index in extraction if col is array.
- Returns the n-th input, e.g., returns input2 when n is 2.
- Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
- Returns a boolean.
- Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
- Returns true if all values of col are true.
- Computes the exponential of the given value.
- Returns a new row for each element in the given array or map.
- Returns a new row for each element in the given array or map.
- Computes the exponential of the given value minus one.
- Parses the expression string into the column that it represents
- Extracts a part of the date/timestamp or interval source.
- Computes the factorial of the given value.
- Returns the index (1-based) of the given string (str) in the comma-delimited list (strArray).
- Returns the first value in a group.
- Returns the first value of col for a group of rows.
- Creates a single array from an array of arrays.
- Computes the floor of the given value.
- Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string.
- Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
- This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
- Returns element of array at given (0-based) index.
- Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
- Returns the value of the bit (0 or 1) at the specified position.
- Returns the greatest value of the list of column names, skipping null values.
- Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
- Returns the level of grouping, equals to
- Calculates the hash code of given columns, and returns the result as an int column.
- Computes hex value of the given column
- Computes a histogram on numeric ‘col’ using nb bins.
- Returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
- Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
- Returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
- Extract the hours of a given timestamp as integer.
- A transform for timestamps to partition data into hours.
- Computes sqrt(a^2 + b^2) without intermediate overflow or underflow.
- Returns col2 if col1 is null, or col1 otherwise.
- Translate the first letter of each word to upper case in the sentence.
- Explodes an array of structs into a table.
- Explodes an array of structs into a table.
- Returns the length of the block being read, or -1 if not available.
- Returns the start offset of the block being read, or -1 if not available.
- Creates a string column for the file name of the current Spark task.
- Locate the position of the first occurrence of substr column in the given string.
- An expression that returns true if the column is NaN.
- Returns true if col is not null, or false otherwise.
- An expression that returns true if the column is null
- Calls a method with reflection.
- Returns the number of elements in the outermost JSON array.
- Returns all the keys of the outermost JSON object as an array.
- Creates a new row for a json column according to the given field names.
- Returns the kurtosis of the values in a group.
- Returns the last value in a group.
- Returns the last day of the month which the given date belongs to.
- Returns the last value of col for a group of rows.
- Returns str with all characters changed to lowercase.
- Returns the least value of the list of column names, skipping null values.
- Returns the leftmost len
(
len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string. - Computes the character length of string data or number of bytes of binary data.
- Creates a Column of spark::expression::Literal value.
- Returns the natural logarithm of the argument.
- Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
- Returns the first argument-based logarithm of the second argument.
- Returns the base-2 logarithm of the argument.
- Computes the natural logarithm of the “given value plus one”.
- Computes the logarithm of the given value in Base 10.
- Converts a string expression to lower case.
- Left-pad the string column to width len with pad.
- Trim the spaces from left end for the specified string value.
- Returns a column with a date built from the year, month and day columns.
- Make DayTimeIntervalType duration from days, hours, mins and secs.
- Make interval from years, months, weeks, days, hours, mins and secs.
- Create timestamp from years, months, days, hours, mins, secs and timezone fields.
- Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.
- Create local date-time from years, months, days, hours, mins, secs fields.
- Make year-month interval from years, months.
- Returns the union of all the given maps.
- Returns true if the map contains the key.
- Returns an unordered array of all entries in the given map.
- Creates a new map from two arrays.
- Converts an array of entries (key value struct types) to a map of values.
- Returns an unordered array containing the keys of the map.
- Returns an unordered array containing the values of the map.
- Returns the maximum value of the expression in a group.
- Returns the value associated with the maximum value of ord.
- Calculates the MD5 digest and returns the value as a 32 character hex string.
- returns the average of the values in a group.
- Returns the median of the values in a group
- Returns the minimum value of the expression in a group.
- Returns the value associated with the minimum value of ord.
- Extract the minutes of a given timestamp as integer.
- Returns the most frequent value in a group.
- A column that generates monotonically increasing 64-bit integers.
- Extract the month of a given date/timestamp as integer.
- A transform for timestamps and dates to partition data into months.
- Returns number of months between dates date1 and date2.
- Creates a struct with the given field names and values.
- Returns col1 if it is not NaN, or col2 if col1 is NaN.
- Returns the negative value.
- Returns the negative value.
- Returns the first date which is later than the value of the date column based on second week day argument.
- Returns the current timestamp at the start of query evaluation.
- Returns the ntile group id (from 1 to n inclusive) in an ordered window partition.
- Returns null if col1 equals to col2, or col1 otherwise.
- Returns col2 if col1 is null, or col1 otherwise.
- Returns col2 if col1 is not null, or col3 otherwise.
- Calculates the byte length for the specified string column.
- Returns the relative rank
- Returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
- Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
- Returns Pi.
- Returns the positive value of dividend mod divisor.
- Returns a new row for each element with position in the given array or map.
- Returns a new row for each element with position in the given array or map.
- Returns the value of the first argument raised to the power of the second argument.
- Returns the value of the first argument raised to the power of the second argument.
- Returns the product of the values in a group.
- Extract the quarter of a given date/timestamp as integer.
- Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
- Throws an exception with the provided error message.
- Generates a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
- Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
- Returns the rank of rows within a window partition.
- Calls a method with reflection.
- Returns true if str matches the Java regex regexp, or false otherwise.
- Returns a count of the number of times that the Java regex pattern regexp is matched in the string str.
- Returns true if str matches the Java regex regexp, or false otherwise.
- Returns the substring that matches the Java regex regexp within the string str.
- Returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns the average of the dependent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns the number of non-null number pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns the intercept of the univariate linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns the coefficient of determination for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns the slope of the linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns REGR_COUNT(y, x) * VAR_POP(x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns REGR_COUNT(y, x) * COVAR_POP(y, x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Returns REGR_COUNT(y, x) * VAR_POP(y) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
- Repeats a string column n times, and returns it as a new string column.
- Returns a reversed string or an array with reverse order of elements.
- Returns the rightmost len
(
len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string. - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
- Returns true if str matches the Java regex regexp, or false otherwise.
- Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.
- Returns a sequential number starting at 1 within a window partition.
- Right-pad the string column to width len with pad.
- Trim the spaces from right end for the specified string value.
- Computes secant of the input column.
- Extract the seconds of a given date as integer.
- Generate a sequence of integers from start to stop, incrementing by step.
- Returns a sha1 hash value as a hex string of the col.
- Returns the hex string result of SHA-1.
- Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
- Shift the given value numBits left.
- (Signed) shift the given value numBits right.
- (Signed) shift the given value numBits right.
- Generates a random permutation of the given array.
- Computes the signum of the given value.
- Computes the signum of the given value.
- Computes sine of the input column.
- Computes hyperbolic sine of the input column.
- Returns the length of the array or map stored in the column.
- Returns the skewness of the values in a group.
- Returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
- Returns true if at least one value of col is true.
- Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
- Returns the SoundEx encoding for a string
- A column for partition ID.
- Splits str by delimiter and return requested part of the split (1-based).
- Computes the square root of the specified float value.
- Separates col1, …, colk into n rows
- Returns a boolean.
- Alias for stddev_samp.
- Alias for stddev_samp.
- Returns population standard deviation of the expression in a group.
- Returns the unbiased sample standard deviation of the expression in a group.
- Creates a new struct column.
- Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
- Returns the substring from string str before count occurrences of the delimiter delim.
- Returns the sum of all values in the expression.
- Returns the sum of distinct values in the expression.
- Computes tangent of the input column.
- Computes hyperbolic tangent of the input column.
- Creates timestamp from the number of microseconds since UTC epoch.
- Creates timestamp from the number of milliseconds since UTC epoch.
- Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
- Convert col to a string based on the format.
- Parses the timestamp with the format to a timestamp without time zone.
- Convert string ‘col’ to a number based on the string format ‘format’
- Converts a Column into pyspark.sql.types.TimestampType using the optionally specified format.
- Parses the timestamp with the format to a timestamp without time zone.
- Parses the timestamp with the format to a timestamp without time zone.
- Returns the UNIX timestamp of the given time.
- This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
- Convert col to a string based on the format.
- A function translate any character in the srcCol by a character in matching.
- Trim the spaces from both ends for the specified string column.
- Returns date truncated to the unit specified by the format.
- Returns element of array at given (1-based) index.
- Returns str with all characters changed to uppercase.
- Decodes a BASE64 encoded string column and returns it as a binary column.
- Inverse of hex.
- Returns the number of days since 1970-01-01.
- Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
- Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
- Returns the number of seconds since 1970-01-01 00:00:00 UTC.
- Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, returns null if failed.
- Converts a string expression to upper case.
- Decodes a str in ‘application/x-www-form-urlencoded’ format using a specific encoding scheme.
- Translates a string into ‘application/x-www-form-urlencoded’ format using a specific encoding scheme.
- Returns the current database.
- Returns the population variance of the values in a group.
- Returns the unbiased sample variance of the values in a group.
- Alias for var_samp
- Returns the Spark version.
- Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, …, 6 = Sunday).
- Extract the week number of a given date as integer.
- Returns the bucket number into which the value of this expression would fall after being evaluated.
- Computes the event time from a window column.
- Returns a string array of values within the nodes of xml that match the XPath expression.
- Returns true if the XPath expression evaluates to true, or if a matching node is found.
- Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
- Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
- Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
- Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
- Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
- Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
- Returns the text contents of the first xml node that matches the XPath expression.
- Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column.
- Extract the year of a given date/timestamp as integer.
- A transform for timestamps and dates to partition data into years.