Crate datafusion_spark

Expand description

Spark Expression packages for DataFusion.

This crate contains a collection of various Spark function packages for DataFusion, implemented using the extension API.

§Available Function Packages

See the list of modules in this crate for available packages.

§Example: using all function packages

You can register all the functions in all packages using the register_all function as shown below. Any existing functions will be overwritten, with these Spark functions taking priority.

// Create a new session context
let mut ctx = SessionContext::new();
// Register all Spark functions with the context
datafusion_spark::register_all(&mut ctx)?;
// Run a query using the `sha2` function which is now available and has Spark semantics
let df = ctx.sql("SELECT sha2('The input String', 256)").await?;

§Example: calling a specific function in Rust

Each package also exports an expr_fn submodule that create Exprs for invoking functions via rust using a fluent style. For example, to invoke the sha2 function, you can use the following code:

use datafusion_spark::expr_fn::sha2;
// Create the expression `sha2(my_data, 256)`
let expr = sha2(col("my_data"), lit(256));

§Example: using the Spark expression planner

The planner::SparkFunctionPlanner provides Spark-compatible expression planning, such as mapping SQL EXTRACT expressions to Spark’s date_part function. To use it, register it with your session context:

use std::sync::Arc;
use datafusion::prelude::SessionContext;
use datafusion_spark::planner::SparkFunctionPlanner;

let mut ctx = SessionContext::new();
// Register the Spark expression planner
ctx.register_expr_planner(Arc::new(SparkFunctionPlanner))?;
// Now EXTRACT expressions will use Spark semantics
let df = ctx.sql("SELECT EXTRACT(YEAR FROM timestamp_col) FROM my_table").await?;

§Example: enabling Apache Spark features with SessionStateBuilder

The recommended way to enable Apache Spark compatibility is to use the SessionStateBuilderSpark extension trait. This registers all Apache Spark functions (scalar, aggregate, window, and table) as well as the Apache Spark expression planner.

Enable the core feature in your Cargo.toml:

datafusion-spark = { version = "X", features = ["core"] }

Then use the extension trait - see SessionStateBuilderSpark::with_spark_features for an example.

Modules§

expr_fn: Fluent-style API for creating Exprs
function
planner

Traits§

SessionStateBuilderSparkcore: Extension trait for adding Apache Spark features to SessionStateBuilder.

Functions§

all_default_aggregate_functions: Returns all default aggregate functions
all_default_scalar_functions: Returns all default scalar functions
all_default_table_functions: Returns all default table functions
all_default_window_functions: Returns all default window functions
register_all: Registers all enabled packages with a FunctionRegistry, overriding any existing functions if there is a name clash.