# Aisle
<p align="left">
<a href="https://crates.io/crates/aisle/"><img src="https://img.shields.io/crates/v/aisle.svg"></a>
<a href="https://docs.rs/aisle"><img src="https://img.shields.io/docsrs/aisle"></a>
<a href="https://github.com/your-org/aisle/blob/main/LICENSE"><img src="https://img.shields.io/crates/l/aisle"></a>
</p>
**Metadata-driven Parquet pruning for Rust: Skip irrelevant data before reading**
Aisle evaluates DataFusion predicates against Parquet metadata (row-group statistics, page indexes, bloom filters) to determine which data to skip, dramatically reducing I/O for selective queries without modifying the upstream `parquet` crate.
📖 **[Read the full documentation on docs.rs](https://docs.rs/aisle)**
## Why Aisle?
**The Problem:** Parquet readers typically apply filters *after* reading data, wasting I/O on irrelevant row groups and pages.
**The Solution:** Aisle evaluates your predicates against metadata *before* reading:
- **Row-group pruning** using min/max statistics
- **Page-level pruning** using column/offset indexes
- **Bloom filter checks** for definite absence (high-cardinality columns)
**The Result:** 70-99% I/O reduction for selective queries with zero changes to the Parquet format.
## Quick Start
```rust
use aisle::PruneRequest;
use datafusion_expr::{col, lit};
use parquet::file::metadata::ParquetMetaDataReader;
use parquet::arrow::ParquetRecordBatchReaderBuilder;
// 1. Load metadata (without reading data)
let metadata = ParquetMetaDataReader::new()
.parse_and_finish(&parquet_bytes)?;
// 2. Define your filter using DataFusion expressions
let predicate = col("user_id").gt_eq(lit(1000i64))
.and(col("age").lt(lit(30i64)));
// 3. Prune row groups
let result = PruneRequest::new(&metadata, &schema)
.with_predicate(&predicate)
.enable_page_index(false) // Row-group level only
.enable_bloom_filter(false) // No bloom filters
.prune();
println!("Pruned {} of {} row groups ({}% I/O reduction)",
metadata.num_row_groups() - result.row_groups().len(),
metadata.num_row_groups(),
((metadata.num_row_groups() - result.row_groups().len()) * 100
/ metadata.num_row_groups())
);
// 4. Apply pruning to Parquet reader
let reader = ParquetRecordBatchReaderBuilder::try_new(parquet_bytes)?
.with_row_groups(result.row_groups().to_vec()) // Skip irrelevant row groups!
.build()?;
// Read only the relevant data
for batch in reader {
// Process matching rows...
}
```
Add to your `Cargo.toml`:
```toml
[dependencies]
aisle = "0.1"
datafusion-expr = "43"
parquet = "57"
arrow-schema = "57"
```
## Key Features
- **Row-group pruning**: Skip entire row groups using min/max statistics
- **Page-level pruning**: Skip individual pages within row groups
- **Bloom filter support**: Definite absence checks for point queries (`=`, `IN`)
- **DataFusion expressions**: Use familiar `col("x").eq(lit(42))` syntax
- **Conservative evaluation**: Never skips data that might match (safety first)
- **Async-first API**: Optimized for remote storage (S3, GCS, Azure)
- **Non-invasive**: Works with upstream `parquet` crate, no format changes
- **Best-effort compilation**: Uses supported predicates even if some fail
## How It Works
```
┌─────────────────────────────────────────────────────┐
│ Your Query │
│ WHERE user_id >= 1000 AND age < 30 │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Aisle Compiler │
│ Converts DataFusion Expr -> Pruning IR │
│ (supports =, !=, <, >, <=, >=, BETWEEN, IN, │
│ IS NULL, LIKE 'prefix%', AND, OR, NOT, CAST) │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Metadata Evaluation │
│ • Row-group statistics (min/max, null_count) │
│ • Page indexes (page-level min/max) │
│ • Bloom filters (definite absence checks) │
│ • Tri-state logic (True/False/Unknown) │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Pruning Result │
│ row_groups: [2, 5, 7] <- Only these needed! │
│ row_selection: Some(...) <- Page-level selection │
│ compile_result: Unsupported predicates logged │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Parquet Reader │
│ .with_row_groups([2, 5, 7]) │
│ .with_row_selection(...) │
│ I/O reduced by 70%! ⚡ │
└─────────────────────────────────────────────────────┘
```
## What's Supported
### Predicates
| **Equality** | `col("x").eq(lit(42))` | ✓ | ✓ | ✓ |
| **Inequality** | `col("x").not_eq(lit(42))` | ✓ | ✓ | ✗ |
| **Comparisons** | `col("x").lt(lit(100))` | ✓ | ✓ | ✗ |
| **Range** | `col("x").between(lit(10), lit(20))` | ✓ | ✓ | ✗ |
| **Set membership** | `col("x").in_list(vec![lit(1), lit(2)])` | ✓ | ✓ | ✓ |
| **Null checks** | `col("x").is_null()` | ✓ | ✓ | ✗ |
| **String prefix** | `col("name").like(lit("prefix%"))` | ✓ | ✓ | ✗ |
| **Logical AND** | `col("x").gt(lit(10)).and(col("y").lt(lit(5)))` | ✓ | ✓ (best-effort) | ✓ |
| **Logical OR** | `col("x").eq(lit(1)).or(col("x").eq(lit(2)))` | ✓ | ✓ (all-or-nothing) | ✓ |
| **Logical NOT** | `col("x").gt(lit(50)).not()` | ✓ | ✓ (exact only) | ✗ |
| **Type casting** | `cast(col("x"), DataType::Int64).eq(lit(100))` | ✓ (no-op casts only) | ✓ | ✓ |
### Data Types
Current supported **leaf** types for statistics-based pruning:
- **Integers**: Int8/16/32/64, UInt8/16/32/64
- **Floats**: Float32/Float64
- **Boolean**
- **Strings**: Utf8, LargeUtf8, Utf8View
- **Binary**: Binary, LargeBinary, BinaryView, FixedSizeBinary
Not yet supported (treated conservatively as "unknown"):
- Temporal logical types (Date32/Date64, Timestamp)
- Decimals (Decimal128/Decimal256)
- Interval/Duration and other complex logical types
### Metadata Sources
| **Statistics** (min/max) | ✓ Always | ✓ Via page index | Range queries |
| **Null count** | ✓ Always | ✓ Via page index | IS NULL checks |
| **Bloom filters** | ✓ Optional | ✗ Not applicable | `=` and `IN` |
## Known Limitations
- **Type coverage is partial**: Only the leaf types listed above are supported for stats-based pruning; temporal/logical types and decimals are currently conservative.
- **Byte array ordering requires column metadata**: For ordering predicates (`<`, `>`, `<=`, `>=`) on Binary/Utf8 columns:
- **Default (conservative)**: Requires `TYPE_DEFINED_ORDER(UNSIGNED)` column order AND exact (non-truncated) min/max statistics
- **Opt-in (aggressive)**: Use `.allow_truncated_byte_array_ordering(true)` to allow truncated statistics, but be aware this may cause false negatives if truncation changes ordering semantics
- **Equality predicates** (`=`, `!=`, `IN`) always work regardless of truncation
- **No non-trivial column casts**: Only no-op column casts are allowed; literal casts happen at compile time.
- **Page-level NOT is conservative**: NOT is only inverted when the inner page selection is exact; otherwise it falls back to row-group evaluation.
- **OR requires full support**: If any OR branch is unsupported at page level, page pruning is disabled for the whole OR.
- **LIKE support is limited**: Only prefix patterns (`'prefix%'`) are pushed down.
## Usage Examples
**Page-level pruning:**
```rust
let result = PruneRequest::new(&metadata, &schema)
.with_predicate(&predicate)
.enable_page_index(true)
.prune();
```
**Async with bloom filters:**
```rust
let metadata = builder.metadata().clone();
let schema = builder.schema().clone();
let result = PruneRequest::new(&metadata, &schema)
.with_predicate(&predicate)
.enable_bloom_filter(true)
.prune_async(&mut builder).await;
```
**Custom bloom provider:**
```rust
impl AsyncBloomFilterProvider for MyProvider {
async fn bloom_filter(&mut self, rg: usize, col: usize) -> Option<Sbbf> {
// Your optimized loading logic
}
}
let result = PruneRequest::new(&metadata, &schema)
.with_predicate(&predicate)
.prune_async(&mut my_provider).await;
```
**Byte array ordering (advanced):**
```rust
// Conservative (default): Requires exact min/max for ordering predicates
let result = PruneRequest::new(&metadata, &schema)
.with_predicate(&col("name").gt(lit("prefix")))
.prune();
// Aggressive: Allow truncated byte array statistics (may have false negatives)
let result = PruneRequest::new(&metadata, &schema)
.with_predicate(&col("name").gt(lit("prefix")))
.allow_truncated_byte_array_ordering(true)
.prune();
```
## Performance
Reasonable expectations (actual results depend on file layout and metadata quality):
| Point query (`id = 12345`) | High (often substantial) | Best with bloom filters + accurate stats |
| Range query (`date BETWEEN ...`) | Moderate to high | Depends on row-group size and data distribution |
| Multi-column filter | Moderate | AND helps; OR can reduce page-level pruning |
| High-cardinality IN (`sku IN (...)`) | Moderate to high | Bloom filters help when present |
These are **guidance only** until benchmarks land.
**Performance Factors:**
- **Row group size**: Larger row groups → better statistics granularity
- **Predicate selectivity**: Lower selectivity → more pruning opportunities
- **Column cardinality**: Bloom filters shine for high-cardinality columns
- **Page index availability**: Enables page-level pruning (Parquet 1.12+)
**Overhead:** Metadata evaluation is typically small relative to I/O, but varies with predicate complexity and metadata availability.
## Examples
Run the included examples to see end-to-end usage:
- **`basic_usage`**: Row-group pruning with metadata and predicates
- **`bloom_filter`**: Async API with bloom filter support
```bash
# Row-group pruning example
cargo run --example basic_usage
# Async + bloom filters
cargo run --example bloom_filter
```
## When to Use Aisle
**Good fit:**
- Selective queries (reading <20% of data)
- Large Parquet files (>100MB, multiple row groups)
- Remote storage (S3, GCS) where I/O is expensive
- High-cardinality point queries (user IDs, transaction IDs)
- Time-series with range queries
**Not needed:**
- Full table scans (no pruning benefit)
- Small files (<10MB, single row group)
- Already using a query engine with built-in pruning (DataFusion, DuckDB)
**Tips:** Combine Aisle with proper Parquet configuration:
- Sort data by frequently-filtered columns
- Use reasonable row group sizes (64-256MB)
- Enable bloom filters for high-cardinality columns
- Write page indexes (Parquet 1.12+)
## Architecture
See detailed documentation:
- **[Architecture](docs/architecture.md)**: Internal design and IR compilation
- **[CLAUDE.md](CLAUDE.md)**: Project overview and design philosophy
- **[Development Plan](docs/development_plan.md)**: Implementation roadmap
## Testing
Aisle has comprehensive test coverage (111 tests):
```bash
# Run all tests
cargo test
# Run specific test suites
cargo test --test best_effort_pruning # NOT pushdown edge cases
cargo test --test null_count_edge_cases # Null handling
cargo test --test async_bloom # Bloom filter integration
```
## License
Licensed under the MIT License. See [LICENSE](LICENSE) for details.
---
**Built with:** [arrow-rs](https://github.com/apache/arrow-rs) • [DataFusion](https://github.com/apache/datafusion) • [Parquet](https://parquet.apache.org/)