datafusion 0.1.0

DataFusion is a datasource-agnostic distributed query planning and execution framework for Rust
Documentation

Distributed Query Processing in Rust

License Build Status

This project is a proof-of-concept of a distributed data processing platform in Rust with features somewhat similar to Apache Spark but it is not intended to be a clone of Apache Spark.

Why am I building this?

Primarily, this is a just a fun side-project for me to use to become a better Rust developer since it involves solving some non-trivial problems. I'm also generally interested in researching distributed systems and query optimizers since I've been working with these concepts professionally for quite a few years now.

Apart from using this is a way to learn, I do think that it could result in a useful product.

I have a hypothesis that even a naive implementation in Rust will have performance that is roughly comparable to that of Apache Spark for simple use cases, but more importantly the performance will be predictable and reliable because there is no garbage collector involved.

What will be similar to Apache Spark?

  • There will be a DataFrame API as well as SQL support and both approaches ultimately are just different ways of defining a query plan
  • A master node will create a distributed execution plan and co-ordinate the execution of the plan across the worker nodes
  • There will be a query optimizer, with some basic optimizations
  • HDFS will be supported

What will be different to Apache Spark?

Due to the statically compiled nature of Rust, this platform will be less interactive:

  • No support is planned for allowing idiomatic Rust lambda functions to be applied to a DataFrame but instead pre-registered UDFs can be used to perform transformations on DataFrames
  • UDTs and UDFs will need to be statically compiled into the worker nodes, at least initially (it would be possible to do some dynamic loading eventually)
  • No interactive REPL is planned although a SQL console would be possible

Current Status

There are two working examples:

Both of these examples run a trivial query against a trivial CSV file using a single thread.

Roadmap

Phase 1 - Benchmark simple use case against Apache Spark

I'd like to be able to run a job that reads a partitioned CSV file from HDFS and performs some computationally intensive processing on that data on a cluster and see how the performance compares to Apache Spark.

Features needed:

  • Partitioning
  • Shuffle
  • Scalar functions
  • HDFS support
  • Worker nodes (dockerized)

Phase 2 - Usability and Stability

  • Deployment model (most likely kubernetes and etcd for service discovery)
  • UI
  • Documentation
  • Automated integration tests
  • Automated performance tests
  • Complete data type support
  • Make SQL / DataFrame functionality more complete

Phase 3 - Make it usable for real-world problems

  • ORDER BY
  • GROUP BY with basic aggregation functions (MIN, MAX, SUM, AVG)
  • JOIN
  • UDTs
  • UDFs
  • UDAFs
  • Query optimizer (starting with simple optimizations like predicate push-down, push predicate through join etc)
  • Integrations with other data sources