Azof
Query tables in object storage as of event time.
Azof is a lakehouse format with time-travel capabilities that allows you to query data as it existed at any point in time, based on when events actually occurred rather than when they were recorded.
What Problem Does Azof Solve?
Traditional data lakehouse formats allow time travel based on when data was written (processing time). Azof instead focuses on event time - the time when events actually occurred in the real world. This distinction is crucial for:
- Late-arriving data: Process data that arrives out of order without rewriting history
- Consistent historical views: Get consistent snapshots of data as it existed at specific points in time
- High cardinality datasets with frequent updates: Efficiently handle use cases involving business processes (sales, support, project management, financial data) or slowly changing dimensions
- Point-in-time analysis: Analyze the state of your data exactly as it was at any moment
Key Features
- Event time-based time travel: Query data based on when events occurred, not when they were recorded
- Efficient storage of updates: Preserves compacted snapshots of state to minimize storage and query overhead
- Hierarchical organization: Uses segments and delta files to efficiently organize temporal data
- Tunable compaction policy: Adjust based on your data distribution patterns
- SQL integration: Query using DataFusion with familiar SQL syntax
- Integration with object storage: Works with any object store (local, S3, etc.)
Project Structure
The Azof project is organized as a Rust workspace with multiple crates:
- azof: The core library providing the lakehouse format functionality
- azof-cli: A CLI utility demonstrating how to use the library
- azof-datafusion: DataFusion integration for SQL queries
Getting Started
To build all projects in the workspace:
Using the CLI
The azof-cli provides a command-line interface for interacting with azof:
# Scan a table (current version)
# Scan a table as of a specific event time
# Generate test parquet file from CSV
DataFusion Integration
The azof-datafusion crate provides integration with Apache DataFusion, allowing you to:
- Register Azof tables in a DataFusion context
- Run SQL queries against Azof tables
- Perform time-travel queries using the AsOf functionality
Example
use ExecutionContext;
async
Run the example:
If you install the CLI with cargo install --path crates/azof-cli, you can run it directly with:
Project Roadmap
Azof is under development. The goal is to implement a data lakehouse with the following capabilities:
- Atomic, non-concurrent writes (single writer)
- Consistent reads
- Schema evolution
- Event time travel queries
- Handling late-arriving data
- Integration with an execution engine
Milestone 0
- Script/tool for generating sample kv data set
- Key-value reader
- DataFusion table provider
Milestone 1
- Multiple columns support
- Data Types columns support
- Projection pushdown
- Projection pushdown in DataFusion table provider
- DataFusion table provider with AS OF operator
- Single row, key-value writer
- Document spec
- Delta -> snapshot compaction
- Metadata validity checks
Milestone 2
- Streaming in scan
- Schema definition and evolution
- Late-arriving data support