Attribute Macro arrow_udf_macros::function

source ·
#[function]
Expand description

Defining a function on Arrow arrays.

§Table of Contents

The following example demonstrates a simple usage:

#[function("add(int, int) -> int")]
fn add(x: i32, y: i32) -> i32 {
    x + y
}

§SQL Function Signature

Each function must have a signature, specified in the function("...") part of the macro invocation. The signature follows this pattern:

name ( [arg_types],* [...] ) [ -> [setof] return_type ]

Where name is the function name.

arg_types is a comma-separated list of argument types. The allowed data types are listed in in the name column of the appendix’s type matrix. Wildcards or auto can also be used, as explained below. If the function is variadic, the last argument can be denoted as ....

When setof appears before the return type, this indicates that the function is a set-returning function (table function), meaning it can return multiple values instead of just one. For more details, see the section on table functions.

If no return type is specified, the function returns null.

§Multiple Function Definitions

Multiple #[function] macros can be applied to a single generic Rust function to define multiple SQL functions of different types. For example:

#[function("add(int16, int16) -> int16")]
#[function("add(int32, int32) -> int32")]
#[function("add(int64, int64) -> int64")]
fn add<T: Add>(x: T, y: T) -> T {
    x + y
}

§Rust Function Signature

The #[function] macro can handle various types of Rust functions. Each argument corresponds to the Rust type T in the type matrix. The return value type can be any type that implements AsRef<T>.

§Nullable Arguments

The functions above will only be called when all arguments are not null. If null arguments need to be considered, the Option type can be used:

#[function("add(int, int) -> int")]
fn add(x: Option<i32>, y: i32) -> i32 {...}

§Return Value

Similarly, the return value type can be one of the following:

  • T: Indicates that a non-null value is always returned, and errors will not occur.
  • Option<T>: Indicates that a null value may be returned, but errors will not occur.
  • Result<T>: Indicates that an error may occur, but a null value will not be returned.
  • Result<Option<T>>: Indicates that a null value may be returned, and an error may also occur.

§Optimization

When all input and output types of the function are primitive type (int2, int4, int8, float4, float8) and do not contain any Option or Result, the #[function] macro will automatically generate SIMD vectorized execution code.

Therefore, try to avoid returning Option and Result whenever possible.

§Functions Returning Strings

For functions that return string types, you can also use the writer style function signature to avoid memory copying and dynamic memory allocation:

#[function("trim(string) -> string")]
fn trim(s: &str, writer: &mut impl Write) {
    writer.write_str(s.trim()).unwrap();
}

If errors may be returned, then the return value should be Result<()>:

#[function("trim(string) -> string")]
fn trim(s: &str, writer: &mut impl Write) -> Result<()> {
    writer.write_str(s.trim()).unwrap();
    Ok(())
}

If null values may be returned, then the return value should be Option<()>:

#[function("trim(string) -> string")]
fn trim(s: &str, writer: &mut impl Write) -> Option<()> {
    if s.is_empty() {
        None
    } else {
        writer.write_str(s.trim()).unwrap();
        Some(())
    }
}

§Table Function

A table function is a special kind of function that can return multiple values instead of just one. Its function signature must include the setof keyword, and the Rust function should return an iterator of the form impl Iterator<Item = T> or its derived types.

For example:

#[function("generate_series(int32, int32) -> setof int32")]
fn generate_series(start: i32, stop: i32) -> impl Iterator<Item = i32> {
    start..=stop
}

Likewise, the return value Iterator can include Option or Result either internally or externally. For instance:

  • impl Iterator<Item = Result<T>>
  • Result<impl Iterator<Item = T>>
  • Result<impl Iterator<Item = Result<Option<T>>>>

§Registration and Invocation

Every function defined by #[function] is automatically registered in the global function registry.

You can lookup the function by name and types:

use arrow_udf::sig::REGISTRY;
use arrow_schema::DataType::Int32;

let sig = REGISTRY.get("add", &[Int32, Int32], &Int32).unwrap();

§Appendix: Type Matrix

§Base Types

Arrow data typeAliasesRust type as argumentRust type as return value
booleanboolboolbool
int8i8i8
int16smallinti16i16
int32inti32i32
int64biginti64i64
float32realf32f32
float32double precisionf64f64
date32datechrono::NaiveDatechrono::NaiveDate
time64timechrono::NaiveTimechrono::NaiveTime
timestampchrono::NaiveDateTimechrono::NaiveDateTime
timestamptznot supported yetnot supported yet
intervalarrow_udf::types::Intervalarrow_udf::types::Interval
stringvarchar&strimpl AsRef<str>, e.g. String, Box<str>, &str
binarybytea&[u8]impl AsRef<[u8]>, e.g. Vec<u8>, Box<[u8]>, &[u8]

§Extension Types

We also support the following extension types that are not part of the Arrow data types:

Data typeMetadataRust type as argumentRust type as return value
decimalarrowudf.decimalrust_decimal::Decimalrust_decimal::Decimal
jsonarrowudf.jsonserde_json::Valueserde_json::Value

§Array Types

SQL typeRust type as argumentRust type as return value
int8[]&[i8]impl Iterator<Item = i8>
int16[]&[i16]impl Iterator<Item = i16>
int32[]&[i32]impl Iterator<Item = i32>
int64[]&[i64]impl Iterator<Item = i64>
float32[]&[f32]impl Iterator<Item = f32>
float64[]&[f64]impl Iterator<Item = f64>
string[]&StringArrayimpl Iterator<Item = &str>
binary[]&BinaryArrayimpl Iterator<Item = &[u8]>
largestring[]&LargeStringArrayimpl Iterator<Item = &str>
largebinary[]&LargeBinaryArrayimpl Iterator<Item = &[u8]>
others[]not supported yetnot supported yet

§Composite Types

SQL typeRust type as argumentRust type as return value
struct<..>UserDefinedStructUserDefinedStruct