DataFusion Index Provider
A library that extends Apache DataFusion with index-based query acceleration for TableProvider implementations.
Overview
This crate implements a two-phase execution model for index-accelerated queries:
- Index Phase: Scan one or more secondary indexes to identify primary key values matching filter predicates
- Fetch Phase: Use primary key values to fetch complete records from underlying storage
This approach reduces I/O by limiting data retrieval to rows that satisfy query predicates, particularly effective for selective queries (typically < 10% of rows).
Installation
Add to your Cargo.toml:
[]
= "0.1.0"
Core Concepts
Index Trait
Implement the Index trait to define how your index scans for matching primary key values:
use Index;
RecordFetcher Trait
Implement RecordFetcher to define how complete records are retrieved using primary key values:
use RecordFetcher;
IndexedTableProvider Trait
Implement IndexedTableProvider on your TableProvider to expose available indexes:
use IndexedTableProvider;
Integration with TableProvider
Use the provided analysis and execution plan methods in your TableProvider::scan:
Query Optimization
Filter Analysis
The analyze_and_optimize_filters method recursively analyzes filter expressions to build an IndexFilter tree:
- Single predicates: Matched to indexes via
Index::supports_predicate() - AND expressions: All sub-expressions must be indexable; creates
IndexFilter::And - OR expressions: All sub-expressions must be indexable; creates
IndexFilter::Or - Nested combinations: Supports arbitrary nesting depth
All-or-nothing policy: If any part of an AND/OR expression cannot be indexed, the entire expression is treated as non-indexable.
Supported Query Patterns
-- Single index scan
SELECT * FROM employees WHERE age = 30;
-- Multi-index intersection (AND)
SELECT * FROM employees WHERE age > 25 AND department = 'Engineering';
-- Multi-index union (OR) with automatic deduplication
SELECT * FROM employees WHERE age < 25 OR department = 'Sales';
-- Complex nested expressions
SELECT * FROM employees
WHERE (age > 30 AND department = 'Engineering')
OR (age < 25 AND department = 'Sales');
Execution Plans
The library generates optimized physical plans based on filter structure:
Single Index
RecordFetchExec
└── IndexScanExec(age_index, [age = 30])
AND - Index Intersection
Uses Hash or SortMerge joins to intersect primary key values:
RecordFetchExec
└── Projection(PK columns)
└── HashJoin(on: PK columns)
├── IndexScanExec(age_index, [age > 25])
└── IndexScanExec(dept_index, [department = 'Engineering'])
Join Selection:
- SortMergeJoin: When both indexes return ordered primary keys (
Index::is_ordered()) - HashJoin: When inputs are unordered (builds hash table from left side)
OR - Union with Deduplication
Uses UnionExec + AggregateExec to deduplicate overlapping primary key values:
RecordFetchExec
└── AggregateExec(GROUP BY PK columns)
└── UnionExec
├── IndexScanExec(age_index, [age < 25])
└── IndexScanExec(dept_index, [department = 'Sales'])
This prevents duplicate record fetches when primary key values appear in multiple index results.
Reference Implementation
See tests/common/ for complete working examples:
Single-column primary key:
age_index.rs: BTreeMap-based index supporting range queries (Eq, Lt, Gt, LtEq, GtEq)department_index.rs: HashMap-based index for equality queriesemployee_provider.rs: Complete TableProvider with IndexedTableProvider integrationrecord_fetcher.rs: RecordFetcher implementation using in-memory data
Composite primary key:
composite_pk_age_index.rs: Index with composite PK (tenant_id, employee_id)composite_pk_department_index.rs: Department index with composite PKcomposite_pk_provider.rs: Multi-tenant TableProvider with composite PKcomposite_pk_fetcher.rs: RecordFetcher for composite PK tables
Integration tests in tests/integration_tests.rs and tests/composite_pk_tests.rs demonstrate query scenarios including deeply nested AND/OR combinations with both single and composite primary keys.
Limitations
- Projection pushdown: Not currently supported for indexed scans (fetches all columns)
- Remaining filters: Non-indexed filters applied after record fetch, not during index scan
- Partitioning: Index scans produce single partition output
Testing
Run the test suite:
# Run all tests
# Run with logging
RUST_LOG=debug
# Run integration tests only
The test suite includes:
- 27 unit tests covering execution plan generation and streaming
- 36 integration tests covering query scenarios from simple to deeply nested, with both single and composite primary keys
Compatibility
- DataFusion: 52.x
- Rust: 1.75+ (MSRV)
- Arrow: Compatible with DataFusion's Arrow version
Architecture Details
For in-depth technical documentation, see:
- Crate documentation - Generated rustdoc
src/lib.rs- Architecture overview and query capabilitiessrc/provider.rs- Filter analysis implementationsrc/physical_plan/exec/fetch.rs- Execution plan generation
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.