Crate aisle

Crate aisle 

Source
Expand description

Metadata-driven Parquet pruning for Rust.

Aisle evaluates DataFusion predicates against Parquet metadata (row-group statistics, page indexes, bloom filters) to determine which data to skip before reading, dramatically reducing I/O for selective queries.

§Why Aisle?

The Problem: Parquet readers typically apply filters after reading data, wasting I/O on irrelevant row groups and pages.

The Solution: Aisle evaluates predicates against metadata before reading:

  • Row-group pruning using min/max statistics
  • Page-level pruning using column/offset indexes
  • Bloom filter checks for definite absence (high-cardinality columns)

The Result: 70-99% I/O reduction for selective queries without modifying the Parquet format.

§Quick Start

use std::sync::Arc;

use aisle::PruneRequest;
use arrow_schema::{DataType, Field, Schema};
use bytes::Bytes;
use datafusion_expr::{col, lit};
use parquet::{
    arrow::arrow_reader::ParquetRecordBatchReaderBuilder, file::metadata::ParquetMetaDataReader,
};

// 1. Load metadata (without reading data)
let metadata = ParquetMetaDataReader::new().parse_and_finish(&parquet_bytes)?;

// 2. Define schema and filter predicate
let schema = Arc::new(Schema::new(vec![
    Field::new("user_id", DataType::Int64, false),
    Field::new("age", DataType::Int64, false),
]));

let predicate = col("user_id")
    .gt_eq(lit(1000i64))
    .and(col("age").lt(lit(30i64)));

// 3. Prune row groups using metadata
let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&predicate)
    .enable_page_index(false) // Row-group level only
    .enable_bloom_filter(false) // No bloom filters
    .prune();

println!(
    "Kept {} of {} row groups",
    result.row_groups().len(),
    metadata.num_row_groups()
);

// 4. Apply pruning to Parquet reader
let reader = ParquetRecordBatchReaderBuilder::try_new(parquet_bytes.clone())?
    .with_row_groups(result.row_groups().to_vec()) // Skip irrelevant row groups!
    .build()?;

// Read only the relevant data (70-99% I/O reduction!)
for batch in reader {
    // Process matching rows...
}

§Key Features

  • Row-group pruning: Skip entire row groups using min/max statistics
  • Page-level pruning: Skip individual pages within row groups
  • Bloom filter support: Definite absence checks for point queries (=, IN)
  • DataFusion expressions: Use familiar col("x").eq(lit(42)) syntax
  • Conservative evaluation: Never skips data that might match (safety first)
  • Async-first API: Optimized for remote storage (S3, GCS, Azure)
  • Non-invasive: Works with upstream parquet crate, no format changes
  • Best-effort compilation: Uses supported predicates even if some fail

§Main API Entry Points

§Synchronous API

Use PruneRequest for the builder-style API:

let result = PruneRequest::new(metadata, schema)
    .with_predicate(&col("id").gt(lit(100i64)))
    .enable_page_index(true)
    .prune();

let kept_row_groups = result.row_groups();
let page_selection = result.row_selection();

§Async API with Bloom Filters

Use PruneRequest::prune_async() for async pruning with bloom filter support:

use aisle::PruneRequest;
use datafusion_expr::{col, lit};
use parquet::arrow::async_reader::ParquetRecordBatchStreamBuilder;
use tokio::fs::File;

let file = File::open("data.parquet").await?;
let mut builder = ParquetRecordBatchStreamBuilder::new(file).await?;

let predicate = col("user_id").eq(lit(12345i64));

let result = PruneRequest::new(builder.metadata(), builder.schema())
    .with_predicate(&predicate)
    .enable_bloom_filter(true)  // Check bloom filters
    .enable_page_index(true)
    .prune_async(&mut builder).await;

println!("Kept {} row groups", result.row_groups().len());

§Custom Bloom Filter Provider

Implement AsyncBloomFilterProvider for optimized bloom filter loading:

use aisle::AsyncBloomFilterProvider;
use parquet::bloom_filter::Sbbf;

struct CachedBloomProvider {
    // Your cache/storage implementation
}

impl AsyncBloomFilterProvider for CachedBloomProvider {
    async fn bloom_filter(&mut self, row_group: usize, column: usize) -> Option<Sbbf> {
        // Load from cache or fetch from storage
    }
}

§Supported Predicates

Aisle supports a conservative subset of DataFusion expressions:

TypeExampleRow-GroupPage-LevelBloom Filter
Equalitycol("x").eq(lit(42))
Inequalitycol("x").not_eq(lit(42))
Comparisonscol("x").lt(lit(100))
Rangecol("x").between(lit(10), lit(20))
Set membershipcol("x").in_list(vec![...])
Null checkscol("x").is_null()
String prefixcol("name").like(lit("prefix%"))
Logical ANDa.and(b)✓ (best-effort)
Logical ORa.or(b)✓ (all-or-nothing)
Logical NOTa.not()✓ (exact only)
Type castingcast(col("x"), DataType::Int64)✓ (no-op only)

Unsupported predicates are logged in CompileResult::errors() but don’t prevent pruning with supported parts.

§Page-Level Pruning

Enable page indexes for finer-grained pruning within row groups:

let result = PruneRequest::new(metadata, schema)
    .with_predicate(&col("id").gt(lit(100i64)))
    .enable_page_index(true)  // Enable page-level pruning
    .prune();

// Apply both row-group and page-level selections
if let Some(row_selection) = result.row_selection() {
    // Use with ParquetRecordBatchReaderBuilder
    // reader.with_row_groups(...).with_row_selection(row_selection)
}

§Error Handling

Aisle uses best-effort compilation. Unsupported predicates are logged but don’t block pruning:

let result = PruneRequest::new(metadata, schema)
    .with_predicate(&col("complex_expr").gt(lit(100i64)))
    .prune();

// Check compilation results
let compile_result = result.compile_result();
if compile_result.error_count() > 0 {
    eprintln!(
        "Warning: {} unsupported predicates",
        compile_result.error_count()
    );
    for error in compile_result.errors() {
        eprintln!("  - {}", error);
    }
}

// Still prune using supported predicates!
println!(
    "Successfully compiled {} predicates",
    compile_result.prunable_count()
);

§Performance

Aisle can dramatically reduce I/O for selective queries:

Query TypeSelectivityI/O Reduction
Point query (id = 12345)0.001%~99.9%
Range query (date BETWEEN ...)2%~98%
Multi-column filter10%~90%

Performance Factors:

  • Row group size (larger → better statistics granularity)
  • Predicate selectivity (lower → more pruning)
  • Column cardinality (bloom filters help high-cardinality)
  • Page index availability (Parquet 1.12+)

Overhead: Metadata evaluation is typically <1ms per row group.

§When to Use Aisle

Good fit:

  • Selective queries (reading <20% of data)
  • Large Parquet files (>100MB, multiple row groups)
  • Remote storage (S3, GCS) where I/O is expensive
  • High-cardinality point queries

Not needed:

  • Full table scans (no pruning benefit)
  • Small files (<10MB, single row group)
  • Already using a query engine with built-in pruning (DataFusion, DuckDB)

§Examples

See the repository examples:

  • basic_usage.rs: Row-group pruning with metadata
  • async_usage.rs: Async API with bloom filters

Structs§

CompileResult
Result of compiling a DataFusion expression into pruning IR.
PruneOptions
Options for controlling metadata pruning behavior
PruneOptionsBuilder
Builder for PruneOptions
PruneRequest
Builder for one-shot metadata pruning operations.
PruneResult
Result of metadata pruning
Pruner
Reusable pruning façade for a fixed schema.

Enums§

CompileError
Errors that can occur during predicate compilation

Traits§

AsyncBloomFilterProvider
Async bloom filter provider trait for custom I/O strategies.

Functions§

roaring_to_row_selection
Convert a RoaringBitmap back to a RowSelection
row_selection_to_roaring
Convert a RowSelection to a RoaringBitmap