DataFusion ORC Datasource
A DataFusion extension providing ORC (Optimized Row Columnar) file format support for Apache DataFusion.
Overview
datafusion-datasource-orc adds comprehensive ORC file format support to Apache DataFusion, enabling efficient query execution on ORC data through predicate pushdown, column projection, and async I/O.
Built on top of orc-rust, it implements DataFusion's file format abstraction traits (FileFormat, FileSource, FileOpener) to provide a seamless experience similar to DataFusion's built-in Parquet support.
Features
- Schema Inference - Automatically infer table schema from ORC files
- Statistics Extraction - Extract file-level statistics for query optimization
- Predicate Pushdown - Filter data at stripe level using ORC row indexes
- Column Projection - Push down column selection to read only needed data
- Async I/O - Non-blocking reads with support for S3, GCS, Azure, and local filesystems
- Multi-file Schema Merging - Seamlessly query across multiple ORC files
Installation
Add to your Cargo.toml:
[]
= "0.0.1"
= "51"
Quick Start
use *;
use ;
use OrcFormat;
use Arc;
async
Configuration
Configure ORC reading behavior via format options:
use ;
let read_options = default
.with_batch_size // Rows per batch
.with_pushdown_predicate // Enable predicate pushdown
.with_metadata_size_hint; // Metadata buffer hint
let format_options = OrcFormatOptions ;
let orc_factory = new_with_options;
let ctx = new;
ctx.register_file_format?;
Format Options
| Option | Type | Default | Description |
|---|---|---|---|
orc.batch_size |
usize |
1024 | Number of rows per RecordBatch |
orc.pushdown_predicate |
bool |
true | Enable/disable predicate pushdown |
orc.metadata_size_hint |
usize |
32768 | Metadata allocation hint in bytes |
Type Support
| ORC Type | Arrow Type | Status |
|---|---|---|
| BOOLEAN | Boolean | ✅ |
| TINYINT | Int8 | ✅ |
| SMALLINT | Int16 | ✅ |
| INT | Int32 | ✅ |
| BIGINT | Int64 | ✅ |
| FLOAT | Float32 | ✅ |
| DOUBLE | Float64 | ✅ |
| STRING | String | ✅ |
| BINARY | Binary | ✅ |
| TIMESTAMP | Timestamp | ✅ |
| LIST | List | ✅ |
| MAP | Map | ✅ |
| STRUCT | Struct | ⏳ |
| DECIMAL | Decimal128 | ⏳ |
| DATE | Date32 | ⏳ |
| VARCHAR | String | ⏳ |
| CHAR | String | ⏳ |
Architecture
SQL Query
↓
DataFusion Logical Plan
↓
DataFusion Physical Plan
↓
OrcFormat.create_physical_plan()
↓
DataSourceExec (using OrcSource)
↓
OrcOpener.open()
↓
orc-rust ArrowReader
↓
Arrow RecordBatch Stream
Core Components
OrcFormat- ImplementsFileFormattrait, provides schema inference and statisticsOrcSource- ImplementsFileSourcetrait, handles predicate pushdownOrcOpener- ImplementsFileOpenertrait, manages file opening and stream creationObjectStoreChunkReader- Bridges DataFusion'sobject_storetoorc-rust's reader
Development
Building
Testing
# Run all tests
# Run specific test module
Benchmarks
# Run all benchmarks
# Run specific benchmark
TODO
- Additional ORC type support (STRUCT, DECIMAL, DATE, VARCHAR, CHAR)
- Enhanced error handling and edge case coverage
- Write support (query results to ORC format)
- Compression options configuration
- Performance optimizations
- Comprehensive integration tests
Related Projects
- Apache DataFusion - Query engine core
- orc-rust - ORC file format Rust implementation
- Apache Arrow - Columnar in-memory format
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Acknowledgments
Built on top of the excellent orc-rust library and inspired by DataFusion's Parquet implementation.