The Hudi-rs project aims to standardize the core Apache Hudi APIs, and broaden the Hudi integration in the data ecosystems for a diverse range of users and projects.
| Source | Downloads | Installation Command |
|---|---|---|
| PyPi.org | pip install hudi |
|
| Crates.io | cargo add hudi |
Usage Examples
[!NOTE] These examples expect a Hudi table exists at
/tmp/trips_table, created using the quick start guide.
Snapshot Query
Snapshot query reads the latest version of the data from the table. The table API also accepts partition filters.
Python
=
=
# convert to PyArrow table
=
=
Rust
use Result;
use TableBuilder as HudiTableBuilder;
use concat_batches;
async
To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set hoodie.read.use.read_optimized.mode when creating the table.
Python
=
Rust
let hudi_table =
from_base_uri
.with_option
.build.await?;
Time-Travel Query
Time-travel query reads the data at a specific timestamp from the table. The table API also accepts partition filters.
Python
=
Rust
let batches =
hudi_table
.read_snapshot_as_of.await?;
The supported formats for the timestamp argument are:
- Hudi Timeline format (highest matching precedence):
yyyyMMddHHmmssSSSoryyyyMMddHHmmss. - Unix epoch time in seconds, milliseconds, microseconds, or nanoseconds.
- ISO 8601 format including but not limited to:
yyyy-MM-dd'T'HH:mm:ss.SSS+00:00yyyy-MM-dd'T'HH:mm:ss.SSSZyyyy-MM-dd'T'HH:mm:ss.SSSyyyy-MM-dd'T'HH:mm:ss+00:00yyyy-MM-dd'T'HH:mm:ssZyyyy-MM-dd'T'HH:mm:ssyyyy-MM-dd
Incremental Query
Incremental query reads the changed data from the table for a given time range.
Python
# read the records between t1 (exclusive) and t2 (inclusive)
=
# read the records after t1
=
Rust
// read the records between t1 (exclusive) and t2 (inclusive)
let batches = hudi_table.read_incremental_records.await?;
// read the records after t1
let batches = hudi_table.read_incremental_records.await?;
Incremental queries support the same timestamp formats as time-travel queries.
File Group Reading (Experimental)
File group reading allows you to read data from a specific file slice. This is useful when integrating with query engines, where the plan provides file paths.
Python
=
# Returns a PyArrow RecordBatch
=
Rust
use FileGroupReader;
let reader = new_with_options?;
// Returns an Arrow RecordBatch
let record_batch = reader.read_file_slice_by_base_file_path.await?;
C++
auto reader = ;
// Returns an ArrowArrayStream pointer
ArrowArrayStream* stream_ptr = reader->;
Query Engine Integration
Hudi-rs provides APIs to support integration with query engines. The sections below highlight some commonly used APIs.
Table API
Create a Hudi table instance using its constructor or the TableBuilder API.
| Stage | API | Description |
|---|---|---|
| Query planning | get_file_slices() |
For snapshot query, get a list of file slices. |
get_file_slices_splits() |
For snapshot query, get a list of file slices in splits. | |
get_file_slices_as_of() |
For time-travel query, get a list of file slices at a given time. | |
get_file_slices_splits_as_of() |
For time-travel query, get a list of file slices in splits at a given time. | |
get_file_slices_between() |
For incremental query, get a list of changed file slices between a time range. | |
| Query execution | create_file_group_reader_with_options() |
Create a file group reader instance with the table instance's configs. |
File Group API
Create a Hudi file group reader instance using its constructor or the Hudi table API create_file_group_reader_with_options().
| Stage | API | Description |
|---|---|---|
| Query execution | read_file_slice() |
Read records from a given file slice; based on the configs, read records from only base file, or from base file and log files, and merge records based on the configured strategy. |
read_file_slice_by_base_file_path() |
Read records from a given base file path; log files will be ignored |
Apache DataFusion
Enabling the hudi crate with datafusion feature will provide a DataFusion
extension to query Hudi tables.
cargo new my_project --bin && cd my_project
cargo add tokio@1 datafusion@45
cargo add hudi --features datafusion
Update src/main.rs with the code snippet below then cargo run.
use Arc;
use Result;
use ;
use HudiDataSource;
async
Other Integrations
Hudi is also integrated with
Work with cloud storage
Ensure cloud storage credentials are set properly as environment variables, e.g., AWS_*, AZURE_*, or GOOGLE_*.
Relevant storage environment variables will then be picked up. The target table's base uri with schemes such
as s3://, az://, or gs:// will be processed accordingly.
Alternatively, you can pass the storage configuration as options via Table APIs.
Python
=
Rust
use TableBuilder as HudiTableBuilder;
async
Contributing
Check out the contributing guide for all the details about making contributions to the project.