rlsp-yaml-parser 0.8.0

# rlsp-yaml-parser Benchmark Results

Comparison of `rlsp-yaml-parser` against `libfyaml` (high-performance C YAML library).

## Environment

| Property | Value |
|----------|-------|
| rustc | 1.94.1 (e408947bf 2026-03-25) |
| CPU | Intel(R) Core(TM) Ultra X7 358H |
| Architecture | x86_64 |
| Platform | Linux (baremetal) |

## Methodology

Benchmarks use [Criterion.rs](https://github.com/bheisler/criterion.rs) with 100 samples per group.

### APIs benchmarked

| API | Description |
|-----|-------------|
| `rlsp::load` | Full tree construction — parses YAML and builds an in-memory `Node` tree with spans and comments |
| `rlsp::parse_events` | Event streaming — yields `Event` values without building a tree; comparable to libfyaml's event API |
| `libfyaml::parse_events` | C event drain — libfyaml's `fy_parser_parse` loop via FFI |

The **fair comparison** for throughput is `rlsp::parse_events` vs `libfyaml::parse_events` — both drain all events without building a persistent tree. `rlsp::load` adds tree construction overhead and is included to show the full-pipeline cost.

### Fixtures

**Synthetic fixtures** (generated by `benches/fixtures.rs`):

| Name | Size | Style |
|------|------|-------|
| tiny_100B | ~100 B | mixed |
| medium_10KB | ~10 KB | mixed |
| large_100KB | ~100 KB | mixed |
| huge_1MB | ~1 MB | mixed |
| block_heavy | ~100 KB | deeply nested block mappings |
| block_sequence | ~100 KB | block sequence of scalars |
| flow_heavy | ~100 KB | flow mapping objects |
| scalar_heavy | ~100 KB | plain, quoted, single-quoted, and literal block scalars |
| mixed | ~100 KB | interleaved block, flow, and scalar constructs |

**Real-world fixture:** A Kubernetes `Deployment` manifest (~3 KB), representative of typical LSP input.

### Memory measurement

Allocation bytes and count are measured using a `CountingAllocator` that wraps the Rust system allocator. This intercepts all Rust heap allocations during `load()` and `parse_events()`. It does **not** intercept `malloc` calls from libfyaml's C code.

## Results

### Latency — time to first event

The primary acceptance criterion for this parser is O(1) first-event latency.

#### Criterion output

```
latency/rlsp/first_event/tiny_100B    time: [40.246 ns 40.899 ns 41.515 ns]
latency/rlsp/first_event/medium_10KB  time: [40.264 ns 40.909 ns 41.504 ns]
latency/rlsp/first_event/large_100KB  time: [40.266 ns 40.910 ns 41.503 ns]
latency/rlsp/first_event/huge_1MB     time: [40.276 ns 40.924 ns 41.523 ns]
latency_real/rlsp/first_event         time: [38.800 ns 40.080 ns 41.222 ns]
```

#### rlsp vs libfyaml — first-event latency

| Fixture | rlsp (streaming) | libfyaml |
|---------|----------------------:|---------:|
| tiny_100B | **40.90 ns** | 783.6 ns |
| medium_10KB | **40.91 ns** | 788.2 ns |
| large_100KB | **40.91 ns** | 785.9 ns |
| huge_1MB | **40.92 ns** | 784.5 ns |
| kubernetes_3KB | **40.08 ns** | 782.0 ns |

> **Acceptance criterion: `huge_1MB` first-event latency < 1 ms.**
> **Measured result: 40.92 ns. Target MET. (~24,400× under the 1 ms threshold.)**

The streaming parser yields its first event in ~40.9 ns regardless of document size — true O(1) first-event latency. libfyaml's lazy parsing achieves ~785 ns constant first-event latency; the streaming parser is ~19× faster still.

### Throughput — full event drain

#### Criterion output

```
throughput/rlsp_events/parse_events/tiny_100B    time: [1.0949 µs 1.0958 µs 1.0967 µs]  thrpt:  [98.266 MiB/s 98.346 MiB/s 98.426 MiB/s]
throughput/rlsp_events/parse_events/medium_10KB  time: [76.843 µs 76.882 µs 76.923 µs]  thrpt: [124.16 MiB/s 124.23 MiB/s 124.29 MiB/s]
throughput/rlsp_events/parse_events/large_100KB  time: [711.05 µs 711.56 µs 712.15 µs]  thrpt: [133.97 MiB/s 134.08 MiB/s 134.18 MiB/s]
throughput/rlsp_events/parse_events/huge_1MB     time: [6.3172 ms 6.3222 ms 6.3271 ms]  thrpt: [150.73 MiB/s 150.85 MiB/s 150.97 MiB/s]
```

#### Throughput by document size

| Fixture | rlsp/load | rlsp/events | libfyaml/events | rlsp/events vs libfyaml |
|---------|---------------:|-----------------:|----------------:|-----------------------------:|
| tiny_100B (~100 B) | 58.55 MiB/s | 98.35 MiB/s | 36.94 MiB/s | 2.66× **faster** |
| medium_10KB (~10 KB) | 64.83 MiB/s | 124.23 MiB/s | 112.40 MiB/s | 1.11× **faster** |
| large_100KB (~100 KB) | 67.43 MiB/s | 134.08 MiB/s | 122.25 MiB/s | 1.10× **faster** |
| huge_1MB (~1 MB) | 52.79 MiB/s | 150.85 MiB/s | 127.20 MiB/s | 1.19× **faster** |

Raw timings (median):

| Fixture | rlsp/load | rlsp/events | libfyaml/events |
|---------|---------------:|-----------------:|----------------:|
| tiny_100B | 1.841 µs | 1.096 µs | 2.917 µs |
| medium_10KB | 147.33 µs | 76.882 µs | 84.973 µs |
| large_100KB | 1.415 ms | 711.56 µs | 780.43 µs |
| huge_1MB | 18.066 ms | 6.322 ms | 7.498 ms |

### Throughput by YAML style (~100 KB each)

#### Criterion output

```
throughput_style/rlsp_events/parse_events/block_heavy    time: [833.09 µs 834.19 µs 835.96 µs]  thrpt: [114.12 MiB/s 114.36 MiB/s 114.51 MiB/s]
throughput_style/rlsp_events/parse_events/block_sequence time: [354.63 µs 354.85 µs 355.08 µs]  thrpt: [268.58 MiB/s 268.75 MiB/s 268.92 MiB/s]
throughput_style/rlsp_events/parse_events/flow_heavy     time: [567.05 µs 568.75 µs 570.63 µs]  thrpt: [167.20 MiB/s 167.75 MiB/s 168.26 MiB/s]
throughput_style/rlsp_events/parse_events/scalar_heavy   time: [371.46 µs 371.73 µs 372.02 µs]  thrpt: [256.41 MiB/s 256.61 MiB/s 256.80 MiB/s]
throughput_style/rlsp_events/parse_events/mixed          time: [724.19 µs 727.02 µs 730.51 µs]  thrpt: [130.61 MiB/s 131.23 MiB/s 131.75 MiB/s]

throughput_style/libfyaml/parse_events/block_heavy    time: [866.59 µs 867.36 µs 868.17 µs]  thrpt: [109.89 MiB/s 109.99 MiB/s 110.09 MiB/s]
throughput_style/libfyaml/parse_events/block_sequence time: [366.83 µs 367.57 µs 368.66 µs]  thrpt: [258.69 MiB/s 259.46 MiB/s 259.98 MiB/s]
throughput_style/libfyaml/parse_events/flow_heavy     time: [1.0505 ms 1.0573 ms 1.0642 ms]  thrpt:  [89.659 MiB/s 90.237 MiB/s 90.825 MiB/s]
throughput_style/libfyaml/parse_events/scalar_heavy   time: [416.00 µs 416.75 µs 418.10 µs]  thrpt: [228.15 MiB/s 228.89 MiB/s 229.30 MiB/s]
throughput_style/libfyaml/parse_events/mixed          time: [800.62 µs 804.22 µs 808.48 µs]  thrpt: [118.01 MiB/s 118.63 MiB/s 119.17 MiB/s]
```

#### Summary table

| Style | rlsp/load | rlsp/events | libfyaml/events | rlsp/events vs libfyaml |
|-------|---------------:|-----------------:|----------------:|-----------------------------:|
| block_heavy | 58.22 MiB/s | 114.36 MiB/s | 109.99 MiB/s | 1.04× **faster** |
| block_sequence | 143.71 MiB/s | 268.75 MiB/s | 259.46 MiB/s | 1.04× **faster** |
| flow_heavy | 65.31 MiB/s | 167.75 MiB/s | 90.24 MiB/s | 1.86× **faster** |
| scalar_heavy | 141.93 MiB/s | 256.61 MiB/s | 228.89 MiB/s | 1.12× **faster** |
| mixed | 67.05 MiB/s | 131.23 MiB/s | 118.63 MiB/s | 1.11× **faster** |

### Throughput — real-world (Kubernetes Deployment, ~3 KB)

#### Criterion output

```
throughput_real/rlsp/load             time: [45.673 µs 45.727 µs 45.782 µs]  thrpt:  [81.011 MiB/s 81.109 MiB/s 81.204 MiB/s]
throughput_real/rlsp_events/parse_events time: [24.129 µs 24.147 µs 24.165 µs]  thrpt: [153.48 MiB/s 153.59 MiB/s 153.71 MiB/s]
throughput_real/libfyaml/parse_events time: [25.840 µs 25.859 µs 25.880 µs]  thrpt: [143.30 MiB/s 143.41 MiB/s 143.51 MiB/s]
```

| API | Time (median) | Throughput |
|-----|-------------:|----------:|
| rlsp/load | 45.727 µs | 81.11 MiB/s |
| rlsp/parse_events | 24.147 µs | 153.59 MiB/s |
| libfyaml/parse_events | 25.859 µs | 143.41 MiB/s |

### Latency — full event drain

#### Criterion output

```
latency/rlsp_full/parse_events/tiny_100B    time: [1.1213 µs 1.1248 µs 1.1287 µs]
latency/rlsp_full/parse_events/medium_10KB  time: [77.747 µs 77.865 µs 78.007 µs]
latency/rlsp_full/parse_events/large_100KB  time: [729.14 µs 730.42 µs 731.91 µs]

latency_real/rlsp_full/parse_events         time: [25.280 µs 25.339 µs 25.394 µs]
latency_real/libfyaml_full/parse_events     time: [25.877 µs 25.916 µs 25.964 µs]
```

| Fixture | rlsp/parse_events | libfyaml/parse_events |
|---------|----------------------:|----------------------:|
| tiny_100B | 1.125 µs | 2.839 µs |
| medium_10KB | 77.87 µs | 85.64 µs |
| large_100KB | 730.4 µs | 806.3 µs |
| kubernetes_3KB | 25.34 µs | 25.92 µs |

### Memory allocation profile

Measured with `CountingAllocator`; single parse in a release build.

#### Criterion output

```
memory/rlsp_load/load/tiny_100B    time: [1.9738 µs 1.9759 µs 1.9780 µs]
memory/rlsp_load/load/medium_10KB  time: [159.42 µs 159.77 µs 160.10 µs]
memory/rlsp_load/load/large_100KB  time: [1.4751 ms 1.4776 ms 1.4803 ms]

memory/rlsp_parse_events/parse_events/tiny_100B    time: [1.1488 µs 1.1497 µs 1.1507 µs]
memory/rlsp_parse_events/parse_events/medium_10KB  time: [79.675 µs 79.713 µs 79.750 µs]
memory/rlsp_parse_events/parse_events/large_100KB  time: [733.44 µs 733.97 µs 734.62 µs]

memory/alloc_stats/large_load      time: [1.5058 ms 1.5083 ms 1.5106 ms]
memory/real_world/load             time: [50.218 µs 50.280 µs 50.339 µs]
```

> **Note:** The memory benchmarks instrument wall-clock time (to measure allocation overhead)
> rather than reporting byte counts directly. The counting allocator intercepts every allocation
> during parse; the timing reflects the overhead of that tracking.

## Analysis

### O(1) first-event latency achieved

The streaming parser yields its first event in ~40.9 ns regardless of document size. This is the
primary design goal: the LSP server can begin producing diagnostics before a large document
is fully parsed.

The huge_1MB fixture first-event latency is 40.92 ns — ~24,400× under the 1 ms acceptance
criterion. libfyaml achieves ~785 ns first-event latency; the streaming parser is ~19× faster
(784.5 ns ÷ 40.92 ns = 19.2×).

### Throughput vs libfyaml: faster on all 10 event-drain fixtures

For the event-drain comparison (apples to apples), rlsp is faster than libfyaml on every fixture:

- `tiny_100B` 2.66× — libfyaml's FFI setup overhead dominates at this size
- `medium_10KB` 1.11×
- `large_100KB` 1.10×
- `huge_1MB` 1.19×
- `block_heavy` 1.04×
- `block_sequence` 1.04×
- `flow_heavy` 1.86× — flow collections are rlsp's strongest style
- `scalar_heavy` 1.12×
- `mixed` 1.11×
- `kubernetes` 1.07× (real-world)

### Real-world: rlsp now faster than libfyaml on the Kubernetes fixture

For the Kubernetes Deployment manifest (the most representative LSP fixture), rlsp events
throughput is 153.59 MiB/s vs libfyaml's 143.41 MiB/s — rlsp is **7% faster** than the C
reference implementation. First-event latency is ~40.1 ns; full-document parse is 24.15 µs
(libfyaml: 25.86 µs).

### Trade-off: correctness and span fidelity vs raw speed

libfyaml is a production C library optimized for speed. rlsp-yaml-parser is a spec-faithful
Rust implementation that preserves lossless byte-range spans (now as compact `u32` offsets with
on-demand line/column resolution via `LineIndex`) and comments — information that libfyaml
discards. Despite this additional fidelity, rlsp is now faster on all event-drain fixtures.

### History

The 2026-04-16 baremetal numbers reflected an 8-commit performance campaign run on commit `3bec2da`:
L5 `9370579`, L2 `d9afbdf`, L7 `3f493a8`, L1 `a506589`, L3 `d586012`, L6 `8097aa5`,
L4 scoped `e812232`, L7b `3bec2da`. This campaign closed the container-vs-baremetal regression
and narrowed the libfyaml gap from "slower on 2–3 fixtures by 10%+" to "slightly behind on 2
fixtures by ≤5%".

The 2026-04-27 numbers reflect a second performance campaign across two plans:
- `2026-04-26-parser-perf-recover-tag-allocations.md`: `Cow::Borrowed` tag URIs (`3f15780`),
  first-byte schema dispatch (`a7206f6`).
- `2026-04-26-parser-perf-recover-node-event-meta-box.md`: `Option<Box<NodeMeta>>` (`d853605`),
  `Option<Box<EventMeta>>` (`76904a9`), lazy `Span` via `LineIndex` (`716771f`),
  `step_in_document` byte-dispatch (`9bd368e`), `#[inline]` on hot-path functions (`ccdfc1a`).

Key struct size reductions: `Node<Span>` 288 → 120 bytes, `Event` ~112 → 40 bytes,
`Span` 48 → 8 bytes. Combined effect: rlsp is now faster than libfyaml on all 10 event-drain
fixtures (was faster on 5, parity on 3, behind on 2). Load throughput improved 8–56% across
fixtures. First-event latency increased slightly (38.9 → 40.9 ns, +5%) — the cost of the new
`Option<Box<EventMeta>>` construction per event — but remains ~24,400× under the 1 ms acceptance
criterion and ~19× faster than libfyaml.