# forge-filter
GPU filter+compact for Apple Silicon. **10x+ faster** than Polars on numeric WHERE clauses, using Metal compute shaders.
```rust
use forge_filter::{GpuFilter, Predicate};
let mut filter = GpuFilter::new()?;
let data: Vec<u32> = (0..16_000_000).collect();
let result = filter.filter_u32(&data, &Predicate::Gt(8_000_000))?;
```
## Benchmarks
Measured on Apple M4 Pro (20-core GPU, 48GB unified memory). Polars baseline: 5.8ms @ 16M u32.
### filter_u32 @ 16M elements
| Ordered | 848 us | **6.8x** |
| Unordered | 574 us | **10.1x** |
### Selectivity sweep (ordered, 16M u32)
| 1% | 695 us | 23,022 |
| 10% | 735 us | 21,769 |
| 50% | 848 us | 18,868 |
| 90% | 935 us | 17,112 |
| 99% | 960 us | 16,667 |
## Features
- **6 numeric types**: u32, i32, f32, u64, i64, f64
- **7 predicates**: `>`, `<`, `>=`, `<=`, `==`, `!=`, `BETWEEN`
- **Compound predicates**: AND/OR with automatic BETWEEN optimization
- **Index output**: get matching row indices for multi-column gather
- **Unordered mode**: 50% faster via atomic scatter (for aggregation queries)
- **Zero-copy**: `FilterBuffer<T>` API for GPU-resident data pipelines
## Requirements
- macOS with Apple Silicon (M1 or later)
- Metal 3.2 support
- Rust 1.70+
- Xcode Command Line Tools (for `xcrun metal` shader compiler)
## Usage
```toml
[dependencies]
forge-filter = "0.1"
```
### Simple (slice in, Vec out)
```rust
use forge_filter::{GpuFilter, Predicate};
let mut filter = GpuFilter::new()?;
let result = filter.filter_u32(&data, &Predicate::Gt(threshold))?;
```
### Zero-copy (FilterBuffer)
```rust
let mut filter = GpuFilter::new()?;
let mut buf = filter.alloc_filter_buffer::<u32>(16_000_000);
buf.copy_from_slice(&data);
let result = filter.filter(&buf, &Predicate::Between(lo, hi))?;
let filtered = result.as_slice();
```
### Index output
```rust
let indices_result = filter.filter_indices(&buf, &Predicate::Lt(100))?;
let indices: &[u32] = indices_result.indices().unwrap();
```
### Unordered (faster for aggregation)
```rust
let result = filter.filter_unordered(&buf, &Predicate::Gt(0))?;
// Same elements as ordered, but in arbitrary order — 50% faster
```
## Algorithm
Fused 3-dispatch pipeline within a single Metal command encoder:
1. **Predicate + Scan** — evaluate predicate per element, SIMD prefix sum, write TG totals
2. **Scan Partials** — exclusive prefix sum of TG totals (hierarchical for >16M elements)
3. **Scatter** — re-evaluate predicate, compute global write positions, scatter to output
Unordered mode uses a single dispatch with SIMD-aggregated atomics.
## License
**Dual-licensed.**
- **Open source**: [AGPL-3.0](LICENSE) — free for open-source projects that comply with AGPL terms.
- **Commercial**: Proprietary license available for closed-source / commercial use. Contact [kavanagh.patrick@gmail.com](mailto:kavanagh.patrick@gmail.com) for pricing.