forge-filter 0.1.0

GPU filter+compact for Apple Silicon — 10x+ over Polars on numeric WHERE clauses
# forge-filter

GPU filter+compact for Apple Silicon. **10x+ faster** than Polars on numeric WHERE clauses, using Metal compute shaders.

```rust
use forge_filter::{GpuFilter, Predicate};

let mut filter = GpuFilter::new()?;
let data: Vec<u32> = (0..16_000_000).collect();
let result = filter.filter_u32(&data, &Predicate::Gt(8_000_000))?;
```

## Benchmarks

Measured on Apple M4 Pro (20-core GPU, 48GB unified memory). Polars baseline: 5.8ms @ 16M u32.

### filter_u32 @ 16M elements

| Mode | 50% sel. | vs Polars |
|------|----------|-----------|
| Ordered | 848 us | **6.8x** |
| Unordered | 574 us | **10.1x** |

### Selectivity sweep (ordered, 16M u32)

| Selectivity | Time | Mrows/s |
|-------------|------|---------|
| 1% | 695 us | 23,022 |
| 10% | 735 us | 21,769 |
| 50% | 848 us | 18,868 |
| 90% | 935 us | 17,112 |
| 99% | 960 us | 16,667 |

## Features

- **6 numeric types**: u32, i32, f32, u64, i64, f64
- **7 predicates**: `>`, `<`, `>=`, `<=`, `==`, `!=`, `BETWEEN`
- **Compound predicates**: AND/OR with automatic BETWEEN optimization
- **Index output**: get matching row indices for multi-column gather
- **Unordered mode**: 50% faster via atomic scatter (for aggregation queries)
- **Zero-copy**: `FilterBuffer<T>` API for GPU-resident data pipelines

## Requirements

- macOS with Apple Silicon (M1 or later)
- Metal 3.2 support
- Rust 1.70+
- Xcode Command Line Tools (for `xcrun metal` shader compiler)

## Usage

```toml
[dependencies]
forge-filter = "0.1"
```

### Simple (slice in, Vec out)

```rust
use forge_filter::{GpuFilter, Predicate};

let mut filter = GpuFilter::new()?;
let result = filter.filter_u32(&data, &Predicate::Gt(threshold))?;
```

### Zero-copy (FilterBuffer)

```rust
let mut filter = GpuFilter::new()?;
let mut buf = filter.alloc_filter_buffer::<u32>(16_000_000);
buf.copy_from_slice(&data);
let result = filter.filter(&buf, &Predicate::Between(lo, hi))?;
let filtered = result.as_slice();
```

### Index output

```rust
let indices_result = filter.filter_indices(&buf, &Predicate::Lt(100))?;
let indices: &[u32] = indices_result.indices().unwrap();
```

### Unordered (faster for aggregation)

```rust
let result = filter.filter_unordered(&buf, &Predicate::Gt(0))?;
// Same elements as ordered, but in arbitrary order — 50% faster
```

## Algorithm

Fused 3-dispatch pipeline within a single Metal command encoder:

1. **Predicate + Scan** — evaluate predicate per element, SIMD prefix sum, write TG totals
2. **Scan Partials** — exclusive prefix sum of TG totals (hierarchical for >16M elements)
3. **Scatter** — re-evaluate predicate, compute global write positions, scatter to output

Unordered mode uses a single dispatch with SIMD-aggregated atomics.

## License

**Dual-licensed.**

- **Open source**: [AGPL-3.0]LICENSE — free for open-source projects that comply with AGPL terms.
- **Commercial**: Proprietary license available for closed-source / commercial use. Contact [kavanagh.patrick@gmail.com]mailto:kavanagh.patrick@gmail.com for pricing.