# metal-json
JSON parsing on the Apple Silicon GPU. metal-json parses standard JSON
documents to a simdjson-equivalent typed tape — parsed `i64`/`u64`/`f64`
numbers, validated + unescaped strings, full grammar validation — entirely
through a Metal compute pipeline, and beats C++ simdjson on the large
documents in our benchmark sweep.
## Results
Median throughput, parse-to-tape + identical stats walk for both headline
contenders, criterion medians on an Apple M5 Max (macOS 26.5, simdjson
v4.6.4 vendored, `-O3`). Full table, provenance, and methodology:
[`docs/bench-report.md`](docs/bench-report.md).
| twitter | 0.6 MiB | 0.49 | 3.96 | 0.12× |
| citm_catalog | 1.6 MiB | 1.90 | 4.36 | 0.44× |
| canada | 2.1 MiB | 2.18 | 1.54 | **1.42×** |
| twitter_100m | 100 MiB | 6.40 | 3.11 | **2.06×** |
| twitter_256m | 256 MiB | 7.13 | 3.26 | **2.19×** |
| twitter_512m | 512 MiB | 7.51 | 2.77 | **2.71×** |
**The claim this data supports:** metal-json is **2.1–2.7× faster** than
C++ simdjson (DOM tape API) on large documents in our benchmark sweep —
100–512 MiB inputs, twitter-shaped (deterministic expansions of the
twitter template), measured on one Apple M5 Max — with every CPU-side cost
of the hybrid pipeline (syncs, allocations, fixup passes) inside the timed
region. Document shape matters: number-dense canada already favors
metal-json at just 2.1 MiB (1.42×), while citm-style documents favor
simdjson at the small sizes we measured — non-twitter shapes were only
benchmarked at 0.6–2.1 MiB, and broader large-document shape coverage is
future work. Parse-only (without the stats walk both sides run), the
512 MiB parse sustains ~11–12 GB/s of wall throughput (the exact figure
moves ~1 GB/s between sessions; see the breakdown in the report).
**The crossover caveat:** the GPU pipeline pays a roughly fixed ~0.5–0.9 ms
of dispatch/sync overhead per parse. On this machine the measured crossover
on the twitter sweep is **≈ 4–5 MiB**: below that simdjson wins (up to ~8×
faster on the 0.6 MiB twitter.json), above it metal-json wins and the gap
widens with size. If your documents are small, use a CPU parser; this
library is for big ones.
## Quick start
```rust
use metal_json::Parser;
fn main() -> Result<(), metal_json::Error> {
let parser = Parser::new()?; // Metal device + pipelines, reusable
let doc = parser.parse(br#"{"name":"meow","tags":[1,2.5]}"#)?;
let root = doc.root();
assert_eq!(root.get("name").unwrap().as_str(), Some("meow"));
assert_eq!(root.get("tags").unwrap().at(1).unwrap().as_f64(), Some(2.5));
// Big files: read once, straight into a pooled page-aligned buffer.
// let doc = parser.parse_file("big.json")?;
Ok(())
}
```
With the `serde` feature, you can deserialize the parsed document into a
typed data model:
```rust
use metal_json::Parser;
use serde::Deserialize;
#[derive(Deserialize)]
struct Payload {
name: String,
tags: Vec<f64>,
}
fn main() -> Result<(), metal_json::Error> {
let parser = Parser::new()?;
let payload: Payload = parser.parse_deserialize(br#"{"name":"meow","tags":[1,2.5]}"#)?;
assert_eq!(payload.tags[1], 2.5);
Ok(())
}
```
`parse_deserialize` returns an owned value. If your struct borrows strings,
parse to a `Document` first and call `doc.deserialize()` so the document can
outlive the borrowed fields.
`Parser` is reusable (device, pipeline states, and a buffer pool are built
once); `Document` is self-contained and returns its buffers to the pool on
drop. For repeated parsing without input copies, fill a parser-provided
page-aligned buffer (`AlignedInput`) and call `parse_aligned`. Zero-copy
*file* input exists as `unsafe fn parse_file_mmap` (mmap →
`bytesNoCopy`): it is `unsafe` because the file must not be truncated or
modified for the duration of the parse — truncating a mapped file can
`SIGBUS` the process — which is exactly why the safe `parse_file` copies.
## What it parses
- **Full standard JSON** (RFC 8259) — objects, arrays, strings with all
escapes (`\uXXXX` + surrogate pairs), numbers, literals; strict UTF-8
validation; no extensions, no relaxed mode. Error parity with the scalar
reference oracle across JSONTestSuite (318/318 two-way).
- Numbers follow simdjson's type policy (`i64` → `u64` → `f64`) with
bit-exact f64s (Eisel-Lemire on the GPU; rare hard roundings re-parsed on
the CPU from a GPU-built fixup list).
- Nesting depth limit 1024 (simdjson parity), configurable via
`ParserOptions::max_depth`.
- Maximum input size 4 GiB − 65 bytes (`u32` tape/token indices).
## Requirements
- **macOS on Apple Silicon** (unified memory is the point: input bytes map
into the GPU with `MTLBuffer bytesNoCopy`, output tapes live in shared
storage — zero copies either way).
- For broad prebuilt binaries, build with `--features runtime-shaders` so
the shader source is compiled on the user's actual Metal device/runtime.
AOT metallibs are best for controlled deployments where the target GPU
family and macOS/Metal runtime are known.
- Source builds without `runtime-shaders` require the Xcode Metal toolchain
for the AOT shader build (`xcodebuild -downloadComponent MetalToolchain`).
## Feature flags
- `runtime-shaders` — compile MSL at runtime instead of embedding an AOT
metallib. This is the preferred compatibility mode for distributed
prebuilt binaries and also supports `METAL_JSON_SHADER_DIR` hot reload
during shader development.
- `cpu-reference` — scalar CPU oracle backend, mainly for testing and
backend parity work.
- `timing` — per-kernel GPU timing helpers.
- `serde` — deserialize parsed `Document`/`Value` cursors into serde data
models. This can avoid an intermediate `serde_json::Value` and skip manual
tape traversal, but it does not skip JSON validation/parsing itself.
## Architecture
The kernels (K1–K13, plus a K6b offset-scatter pass) run across four
command buffers; each CPU sync reads back exact sizes so every allocation is
exact-fit. Bitmaps are built 64 input bytes per thread; scans are
hierarchical reduce→spine→apply (no decoupled look-back — Apple GPUs give no
forward-progress guarantee).
```text
input bytes ── mmap / page-aligned ──> MTLBuffer (bytesNoCopy, zero copy)
│
CB1 │ K1 classify+escape+UTF-8 → K2 spine scan → K3 in-string mask → K4 spine scan
├── CPU sync 1: token count → exact-size token/scratch allocations
CB2 │ K5 token scatter → K6 local validation + tape footprints → K7 spine scan
├── CPU sync 2: tape/stringbuf/list sizes → exact-fit allocations
CB2b│ K6b scatter per-token tape offsets (after this sync every CB3 size is known)
CB3 │ K8 counting sort by depth → K9 bracket pairing + container context
│ K10 numbers (Eisel-Lemire f64 bit patterns) → K11 string unescape
│ K12 container tape words → K13 root/finalize + error reduce
├── CPU: error verdict; rare fixups (hard float roundings, >16 KiB strings)
▼
Document: typed tape + string buffer (shared storage, read directly by the CPU)
```
Design notes that matter for performance and correctness:
- **Bracket matching without sorting networks**: a stable counting sort by
depth makes matching brackets adjacent — one sort yields pairing *and*
comma/colon container context.
- **No fp64 on Apple GPUs**: the number kernel computes f64 *bit patterns*
with 64-bit integer math (portable `umul128` via 32-bit limbs — measured
fastest in `docs/spikes.md`).
- **Valves for adversarial inputs**: backslash walls and >16 KiB strings
divert to CPU fixup lists instead of serializing a GPU lane; their cost is
included in every benchmark number.
- **Errors are values**: structured `Error` with byte offsets (`Syntax`,
`Utf8`, `DepthLimit`, `TrailingContent`, …); earliest error wins
deterministically via atomic min-reduction. Invalid input never panics.
## Benchmarks
```sh
cargo run -p xtask -- fetch-data # canonical datasets
cargo run -p xtask -- gen-data --template twitter --size 512m # large variants
cargo run -p xtask -- bench-report # criterion + report
```
`bench-report` regenerates [`docs/bench-report.md`](docs/bench-report.md):
GB/s medians for metal-json / C++ simdjson (vendored FFI, DCE-defeating tape
walk on both sides) / Rust simd-json / serde_json, dataset sha256s, the size
sweep with the CPU/GPU crossover, and a per-kernel time breakdown
(`--features timing`).
## Status
All milestones complete:
- **M0** — workspace, AOT `.metal` → metallib build, Metal wrapper layer,
decision spikes (`docs/spikes.md`).
- **M1** — tape format v1 (`docs/tape-format.md`), `Document`/`Value` API,
scalar CPU reference oracle (`cpu-reference` feature).
- **M2** — GPU stage 1: classify/escape/UTF-8 bitmaps, spine scans, token
extraction (K1–K5).
- **M3** — GPU structure: validation, depth sort, bracket pairing, container
tape (K6–K9, K12–K13); JSONTestSuite parity.
- **M4** — GPU scalars: Eisel-Lemire numbers + string unescape (K10–K11);
full GPU tape, differential-tested against serde_json bit-for-bit.
- **M5** — zero-copy input + pooled buffers, vendored simdjson benchmark
harness, per-kernel timing, optimization to the headline numbers above.
Tests run with `MTL_SHADER_VALIDATION=1`; the GPU backend is diffed against
the CPU oracle on JSONTestSuite, proptest-generated documents, and number/
string torture corpora.
## License
MIT OR Apache-2.0