Crate datafusion

Expand description

DataFusion is an extensible query execution framework that uses Apache Arrow as its in-memory format.

DataFusion supports both an SQL and a DataFrame API for building logical query plans as well as a query optimizer and execution engine capable of parallel execution against partitioned data sources (CSV and Parquet) using threads.

Below is an example of how to execute a query against data stored in a CSV file using a DataFrame:


let ctx = SessionContext::new();

// create the dataframe
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;

// create a plan
let df = df.filter(col("a").lt_eq(col("b")))?
           .aggregate(vec![col("a")], vec![min(col("b"))])?
           .limit(0, Some(100))?;

// execute the plan
let results: Vec<RecordBatch> = df.collect().await?;

// format the results
let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?
   .to_string();

let expected = vec![
    "+---+----------------+",
    "| a | MIN(?table?.b) |",
    "+---+----------------+",
    "| 1 | 2              |",
    "+---+----------------+"
];

assert_eq!(pretty_results.trim().lines().collect::<Vec<_>>(), expected);

and how to execute a query against a CSV using SQL:


let ctx = SessionContext::new();

ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;

// create a plan
let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100").await?;

// execute the plan
let results: Vec<RecordBatch> = df.collect().await?;

// format the results
let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?
  .to_string();

let expected = vec![
    "+---+----------------+",
    "| a | MIN(example.b) |",
    "+---+----------------+",
    "| 1 | 2              |",
    "+---+----------------+"
];

assert_eq!(pretty_results.trim().lines().collect::<Vec<_>>(), expected);

Parse, Plan, Optimize, Execute

DataFusion is a fully fledged query engine capable of performing complex operations. Specifically, when DataFusion receives an SQL query, there are different steps that it passes through until a result is obtained. Broadly, they are:

The string is parsed to an Abstract syntax tree (AST) using sqlparser.
The planner SqlToRel converts logical expressions on the AST to logical expressions Exprs.
The planner SqlToRel converts logical nodes on the AST to a LogicalPlan.
OptimizerRules are applied to the LogicalPlan to optimize it.
The LogicalPlan is converted to an ExecutionPlan by a PhysicalPlanner
The ExecutionPlan is executed against data through the SessionContext

With a DataFrame API, steps 1-3 are not used as the DataFrame builds the LogicalPlan directly.

Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a lot of effort to ensure that phase 6 runs efficiently and without errors.

DataFusion’s planning is divided in two main parts: logical planning and physical planning.

Logical plan

Logical planning yields logical plans and logical expressions. These are Schema-aware traits that represent statements whose result is independent of how it should physically be executed.

A LogicalPlan is a Directed Acyclic Graph (DAG) of other LogicalPlans and each node contains logical expressions (Exprs). All of these are located in datafusion_expr.

Physical plan

A Physical plan (ExecutionPlan) is a plan that can be executed against data. Contrarily to a logical plan, the physical plan has concrete information about how the calculation should be performed (e.g. what Rust functions are used) and how data should be loaded into memory.

ExecutionPlan uses the Arrow format as its in-memory representation of data, through the arrow crate. We recommend going through its documentation for details on how the data is physically represented.

A ExecutionPlan is composed by nodes (implement the trait ExecutionPlan), and each node is composed by physical expressions (PhysicalExpr) or aggreagate expressions (AggregateExpr). All of these are located in the module physical_plan.

Broadly speaking,

an ExecutionPlan receives a partition number and asyncronosly returns an iterator over RecordBatch (a node-specific struct that implements RecordBatchReader)
a PhysicalExpr receives a RecordBatch and returns an Array
an AggregateExpr receives RecordBatches and returns a RecordBatch of a single row(*)

(*) Technically, it aggregates the results on each partition and then merges the results into a single partition.

The following physical nodes are currently implemented:

Projection: ProjectionExec
Filter: FilterExec
Grouped and non-grouped aggregations: AggregateExec
Hash Join: HashJoinExec
Cross Join: CrossJoinExec
Sort Merge Join: SortMergeJoinExec
Union: UnionExec
Sort: SortExec
Coalesce partitions: CoalescePartitionsExec
Limit: LocalLimitExec and GlobalLimitExec
Scan CSV: CsvExec
Scan Parquet: ParquetExec
Scan Avro: AvroExec
Scan newline-delimited JSON: NdJsonExec
Scan from memory: MemoryExec
Explain the plan: ExplainExec

Customize

DataFusion allows users to

extend the planner to use user-defined logical and physical nodes (QueryPlanner)
declare and use user-defined scalar functions (ScalarUDF)
declare and use user-defined aggregate functions (AggregateUDF)

you can find examples of each of them in examples section.

Examples

Examples are located in datafusion-examples directory

Here’s how to run them

git clone https://github.com/apache/arrow-datafusion
cd arrow-datafusion
# Download test data
git submodule update --init

cargo run --example csv_sql

cargo run --example parquet_sql

cargo run --example dataframe

cargo run --example dataframe_in_memory

cargo run --example simple_udaf

cargo run --example simple_udf

Re-exports

pub use arrow;

pub use parquet;

pub use datafusion_common as common;

pub use datafusion_expr as logical_expr;

pub use datafusion_optimizer as optimizer;

pub use datafusion_physical_expr as physical_expr;

pub use datafusion_row as row;

pub use datafusion_sql as sql;

Modules

avro_to_arrow

This module contains utilities to manipulate avro metadata.

catalog

This module contains interfaces and default implementations of table namespacing concepts, including catalogs and schemas.

config

DataFusion Configuration Options

dataframe

DataFrame API for building and executing query plans.

datasource

DataFusion data sources

error

DataFusion error types

execution

This module contains the shared state available at different parts of query planning and execution

from_slice

A trait to define from_slice functions for arrow types

physical_optimizer

This module contains a query optimizer that operates against a physical plan and applies rules to a physical plan, such as “Repartition”.

physical_plan

Traits for physical query plan, supporting parallel execution for partitioned relations.

prelude

A “prelude” for users of the datafusion crate.

scalar

ScalarValue reimported from datafusion-common to easy migration when datafusion was split into several different crates

test_util

Utility functions to make testing DataFusion based crates easier

variable

Variable provider

Macros

assert_batches_eq

Compares formatted output of a record batch with an expected vector of strings, with the result of pretty formatting record batches. This is a macro so errors appear on the correct line

assert_batches_sorted_eq

Compares formatted output of a record batch with an expected vector of strings in a way that order does not matter. This is a macro so errors appear on the correct line

Constants

DATAFUSION_VERSION

DataFusion crate version