Crate datafusion_materialized_views

Crate datafusion_materialized_views 

Source
Expand description

§datafusion-materialized-views

datafusion-materialized-views provides robust algorithms and core functionality for working with materialized views in DataFusion.

§Key Features

  • Incremental View Maintenance: Efficiently tracks dependencies between Hive-partitioned tables and their materialized views, allowing users to determine which partitions need to be refreshed when source data changes. This is achieved via UDTFs such as mv_dependencies and stale_files.
  • Query Rewriting: Implements a view matching optimizer that rewrites queries to automatically leverage materialized views when beneficial, based on the techniques described in the paper.
  • Pluggable Metadata Sources: Supports custom metadata sources for incremental view maintenance, with default support for object store metadata via the FileMetadata and RowMetadataRegistry components.
  • Extensible Table Abstractions: Defines traits such as ListingTableLike and Materialized to abstract over Hive-partitioned tables and materialized views, enabling custom implementations and easy registration for use in the maintenance and rewriting logic.

§Typical Workflow

  1. Define and Register Views: Implement a custom table type that implements the Materialized trait, and register it using register_materialized.
  2. Metadata Initialization: Set up FileMetadata and RowMetadataRegistry to track file-level and row-level metadata.
  3. Dependency Tracking: Use the mv_dependencies UDTF to generate build graphs for materialized views, and stale_files to identify partitions that require recomputation.
  4. Query Optimization: Enable the query rewriting optimizer to transparently rewrite queries to use materialized views where possible.

§Example

See the README and integration tests for a full walkthrough of setting up and maintaining a materialized view, including dependency tracking and query rewriting.

§Limitations

  • Currently supports only Hive-partitioned tables in object storage, with the smallest update unit being a file.
  • Future work may generalize to other storage backends and partitioning schemes.

§References

Modules§

materialized
Code for incremental view maintenance against Hive-partitioned tables.
rewrite
An implementation of Query Rewriting, an optimization that rewrites queries to make use of materialized views.

Structs§

MaterializedConfig
Configuration options for materialized view related features.