forge-filter 0.1.0

GPU filter+compact for Apple Silicon — 10x+ over Polars on numeric WHERE clauses
docs.rs failed to build forge-filter-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

forge-filter

GPU filter+compact for Apple Silicon. 10x+ faster than Polars on numeric WHERE clauses, using Metal compute shaders.

use forge_filter::{GpuFilter, Predicate};

let mut filter = GpuFilter::new()?;
let data: Vec<u32> = (0..16_000_000).collect();
let result = filter.filter_u32(&data, &Predicate::Gt(8_000_000))?;

Benchmarks

Measured on Apple M4 Pro (20-core GPU, 48GB unified memory). Polars baseline: 5.8ms @ 16M u32.

filter_u32 @ 16M elements

Mode 50% sel. vs Polars
Ordered 848 us 6.8x
Unordered 574 us 10.1x

Selectivity sweep (ordered, 16M u32)

Selectivity Time Mrows/s
1% 695 us 23,022
10% 735 us 21,769
50% 848 us 18,868
90% 935 us 17,112
99% 960 us 16,667

Features

  • 6 numeric types: u32, i32, f32, u64, i64, f64
  • 7 predicates: >, <, >=, <=, ==, !=, BETWEEN
  • Compound predicates: AND/OR with automatic BETWEEN optimization
  • Index output: get matching row indices for multi-column gather
  • Unordered mode: 50% faster via atomic scatter (for aggregation queries)
  • Zero-copy: FilterBuffer<T> API for GPU-resident data pipelines

Requirements

  • macOS with Apple Silicon (M1 or later)
  • Metal 3.2 support
  • Rust 1.70+
  • Xcode Command Line Tools (for xcrun metal shader compiler)

Usage

[dependencies]
forge-filter = "0.1"

Simple (slice in, Vec out)

use forge_filter::{GpuFilter, Predicate};

let mut filter = GpuFilter::new()?;
let result = filter.filter_u32(&data, &Predicate::Gt(threshold))?;

Zero-copy (FilterBuffer)

let mut filter = GpuFilter::new()?;
let mut buf = filter.alloc_filter_buffer::<u32>(16_000_000);
buf.copy_from_slice(&data);
let result = filter.filter(&buf, &Predicate::Between(lo, hi))?;
let filtered = result.as_slice();

Index output

let indices_result = filter.filter_indices(&buf, &Predicate::Lt(100))?;
let indices: &[u32] = indices_result.indices().unwrap();

Unordered (faster for aggregation)

let result = filter.filter_unordered(&buf, &Predicate::Gt(0))?;
// Same elements as ordered, but in arbitrary order — 50% faster

Algorithm

Fused 3-dispatch pipeline within a single Metal command encoder:

  1. Predicate + Scan — evaluate predicate per element, SIMD prefix sum, write TG totals
  2. Scan Partials — exclusive prefix sum of TG totals (hierarchical for >16M elements)
  3. Scatter — re-evaluate predicate, compute global write positions, scatter to output

Unordered mode uses a single dispatch with SIMD-aggregated atomics.

License

Dual-licensed.

  • Open source: AGPL-3.0 — free for open-source projects that comply with AGPL terms.
  • Commercial: Proprietary license available for closed-source / commercial use. Contact kavanagh.patrick@gmail.com for pricing.