Ballista: Distributed Scheduler for Apache DataFusion
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.

Ballista is a distributed query execution engine that enhances Apache DataFusion by enabling the parallelized execution of workloads across multiple nodes in a distributed environment.
Existing DataFusion application:
use *;
async
can be distributed with few lines changed:
use *;
use *;
async
Starting a cluster
A simple way to start a local cluster for testing purposes is to use cargo to install the scheduler and executor crates.
With these crates installed, it is now possible to start a scheduler process.
RUST_LOG=info
The scheduler will bind to port 50050 by default.
Next, start an executor processes in a new terminal session with the specified concurrency level.
RUST_LOG=info
The executor will bind to port 50051 by default. Additional executors can be started by manually specifying a bind port.
For full documentation, refer to the deployment section of the Ballista User Guide
Executing a Query
Ballista provides a custom SessionContext as a starting point for creating queries. DataFrames can be created by invoking the read_csv, read_parquet, and sql methods.
To build a simple ballista example, run the following command to add the dependencies to your Cargo.toml file:
use *;
use Result;
use ;
use ;
async
The output should look similar to the following table.
+-----------------+--------------------------+--------------------------+--------------------------+--------------------------+
| passenger_count | MIN(?table?.fare_amount) | MAX(?table?.fare_amount) | AVG(?table?.fare_amount) | SUM(?table?.fare_amount) |
+-----------------+--------------------------+--------------------------+--------------------------+--------------------------+
| | -159.5 | 285.2 | 17.60577640099004 | 1258865.829999991 |
| 0 | -115 | 500 | 11.794859107585335 | 614052.1600000001 |
| 1 | -480 | 401092.32 | 12.61028389876563 | 22623542.879999973 |
| 2 | -250 | 640.5 | 13.79501011585127 | 4732047.139999998 |
| 3 | -130 | 480 | 13.473184817311106 | 1139427.2400000002 |
| 4 | -250 | 464 | 14.232650547832726 | 502711.4499999997 |
+-----------------+--------------------------+--------------------------+--------------------------+--------------------------+
More examples can be found in the datafusion-ballista repository.
Performance
We run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations.
These are benchmarks derived from TPC-H and not official TPC-H benchmarks. These results are from running individual queries at scale factor 100 (100 GB) on a single node with a single executor and 8 concurrent tasks.
Overall Speedup
The overall speedup is 2.9x

Per Query Comparison

Relative Speedup

Absolute Speedup
