rlsp-yaml-parser 0.11.0

# rlsp-yaml-parser Benchmark Results

Comparison of `rlsp-yaml-parser` against `libfyaml` (high-performance C YAML library).

## Environment

| Property | Value |
|----------|-------|
| rustc | 1.94.1 (e408947bf 2026-03-25) |
| CPU | Intel(R) Core(TM) Ultra X7 358H |
| Architecture | x86_64 |
| Platform | Linux (baremetal) |

## Methodology

Benchmarks use [Criterion.rs](https://github.com/bheisler/criterion.rs) with 100 samples per group.

### APIs benchmarked

| API | Description |
|-----|-------------|
| `rlsp::load` | Full tree construction — parses YAML and builds an in-memory `Node` tree with spans and comments |
| `rlsp::parse_events` | Event streaming — yields `Event` values without building a tree; comparable to libfyaml's event API |
| `libfyaml::parse_events` | C event drain — libfyaml's `fy_parser_parse` loop via FFI |

The **fair comparison** for throughput is `rlsp::parse_events` vs `libfyaml::parse_events` — both drain all events without building a persistent tree. `rlsp::load` adds tree construction overhead and is included to show the full-pipeline cost.

### Fixtures

**Synthetic fixtures** (generated by `benches/fixtures.rs`):

| Name | Size | Style |
|------|------|-------|
| tiny_100B | ~100 B | mixed |
| medium_10KB | ~10 KB | mixed |
| large_100KB | ~100 KB | mixed |
| huge_1MB | ~1 MB | mixed |
| block_heavy | ~100 KB | deeply nested block mappings |
| block_sequence | ~100 KB | block sequence of scalars |
| flow_heavy | ~100 KB | flow mapping objects |
| scalar_heavy | ~100 KB | plain, quoted, single-quoted, and literal block scalars |
| mixed | ~100 KB | interleaved block, flow, and scalar constructs |

**Real-world fixture:** A Kubernetes `Deployment` manifest (~3 KB), representative of typical LSP input.

### Memory measurement

Allocation bytes and count are measured using a `CountingAllocator` that wraps the Rust system allocator. This intercepts all Rust heap allocations during `load()` and `parse_events()`. It does **not** intercept `malloc` calls from libfyaml's C code.

## Results

### Latency — time to first event

The primary acceptance criterion for this parser is O(1) first-event latency.

#### Criterion output

```
latency/rlsp/first_event/tiny_100B    time: [41.032 ns 42.401 ns 43.618 ns]
latency/rlsp/first_event/medium_10KB  time: [41.015 ns 42.408 ns 43.655 ns]
latency/rlsp/first_event/large_100KB  time: [41.017 ns 42.394 ns 43.624 ns]
latency/rlsp/first_event/huge_1MB     time: [40.963 ns 42.361 ns 43.602 ns]
latency_real/rlsp/first_event         time: [42.303 ns 42.719 ns 43.133 ns]
```

#### rlsp vs libfyaml — first-event latency

| Fixture | rlsp (streaming) | libfyaml |
|---------|----------------------:|---------:|
| tiny_100B | **42.40 ns** | 754.6 ns |
| medium_10KB | **42.41 ns** | 746.7 ns |
| large_100KB | **42.39 ns** | 747.6 ns |
| huge_1MB | **42.36 ns** | 749.0 ns |
| kubernetes_3KB | **42.72 ns** | 744.5 ns |

> **Acceptance criterion: `huge_1MB` first-event latency < 1 ms.**
> **Measured result: 42.36 ns. Target MET. (~23,600× under the 1 ms threshold.)**

The streaming parser yields its first event in ~42.4 ns regardless of document size — true O(1) first-event latency. libfyaml's lazy parsing achieves ~749 ns constant first-event latency; the streaming parser is ~18× faster still.

### Throughput — full event drain

#### Criterion output

```
throughput/rlsp_events/parse_events/tiny_100B    time: [1.1895 µs 1.1906 µs 1.1917 µs]  thrpt:  [90.430 MiB/s 90.515 MiB/s 90.600 MiB/s]
throughput/rlsp_events/parse_events/medium_10KB  time: [78.095 µs 78.156 µs 78.219 µs]  thrpt: [122.11 MiB/s 122.20 MiB/s 122.30 MiB/s]
throughput/rlsp_events/parse_events/large_100KB  time: [748.67 µs 749.17 µs 749.68 µs]  thrpt: [127.26 MiB/s 127.35 MiB/s 127.44 MiB/s]
throughput/rlsp_events/parse_events/huge_1MB     time: [7.0747 ms 7.0800 ms 7.0852 ms]  thrpt: [134.60 MiB/s 134.70 MiB/s 134.80 MiB/s]
```

#### Throughput by document size

| Fixture | rlsp/load | rlsp/events | libfyaml/events | rlsp/events vs libfyaml |
|---------|---------------:|-----------------:|----------------:|-----------------------------:|
| tiny_100B (~100 B) | 55.04 MiB/s | 90.52 MiB/s | 38.36 MiB/s | 2.36× **faster** |
| medium_10KB (~10 KB) | 61.10 MiB/s | 122.20 MiB/s | 106.35 MiB/s | 1.15× **faster** |
| large_100KB (~100 KB) | 64.68 MiB/s | 127.35 MiB/s | 113.95 MiB/s | 1.12× **faster** |
| huge_1MB (~1 MB) | 54.92 MiB/s | 134.70 MiB/s | 119.72 MiB/s | 1.13× **faster** |

Raw timings (median):

| Fixture | rlsp/load | rlsp/events | libfyaml/events |
|---------|---------------:|-----------------:|----------------:|
| tiny_100B | 1.958 µs | 1.191 µs | 2.810 µs |
| medium_10KB | 156.33 µs | 78.156 µs | 89.806 µs |
| large_100KB | 1.475 ms | 749.17 µs | 837.28 µs |
| huge_1MB | 17.366 ms | 7.080 ms | 7.966 ms |

### Throughput by YAML style (~100 KB each)

#### Criterion output

```
throughput_style/rlsp_events/parse_events/block_heavy    time: [844.33 µs 845.62 µs 847.17 µs]  thrpt: [112.61 MiB/s 112.82 MiB/s 112.99 MiB/s]
throughput_style/rlsp_events/parse_events/block_sequence time: [410.21 µs 411.21 µs 412.26 µs]  thrpt: [231.33 MiB/s 231.92 MiB/s 232.49 MiB/s]
throughput_style/rlsp_events/parse_events/flow_heavy     time: [628.13 µs 628.80 µs 629.54 µs]  thrpt: [151.56 MiB/s 151.73 MiB/s 151.90 MiB/s]
throughput_style/rlsp_events/parse_events/scalar_heavy   time: [436.39 µs 436.73 µs 437.10 µs]  thrpt: [218.23 MiB/s 218.42 MiB/s 218.59 MiB/s]
throughput_style/rlsp_events/parse_events/mixed          time: [735.43 µs 736.03 µs 736.70 µs]  thrpt: [129.51 MiB/s 129.63 MiB/s 129.73 MiB/s]

throughput_style/libfyaml/parse_events/block_heavy    time: [950.14 µs 950.75 µs 951.37 µs]  thrpt: [100.28 MiB/s 100.34 MiB/s 100.41 MiB/s]
throughput_style/libfyaml/parse_events/block_sequence time: [414.84 µs 415.43 µs 416.17 µs]  thrpt: [229.16 MiB/s 229.57 MiB/s 229.89 MiB/s]
throughput_style/libfyaml/parse_events/flow_heavy     time: [1.1526 ms 1.1547 ms 1.1568 ms]  thrpt: [82.476 MiB/s 82.627 MiB/s 82.777 MiB/s]
throughput_style/libfyaml/parse_events/scalar_heavy   time: [441.76 µs 442.06 µs 442.35 µs]  thrpt: [215.64 MiB/s 215.78 MiB/s 215.93 MiB/s]
throughput_style/libfyaml/parse_events/mixed          time: [836.03 µs 837.37 µs 838.97 µs]  thrpt: [113.72 MiB/s 113.94 MiB/s 114.12 MiB/s]
```

#### Summary table

| Style | rlsp/load | rlsp/events | libfyaml/events | rlsp/events vs libfyaml |
|-------|---------------:|-----------------:|----------------:|-----------------------------:|
| block_heavy | 53.46 MiB/s | 112.82 MiB/s | 100.34 MiB/s | 1.12× **faster** |
| block_sequence | 124.57 MiB/s | 231.92 MiB/s | 229.57 MiB/s | 1.01× **faster** |
| flow_heavy | 61.47 MiB/s | 151.73 MiB/s | 82.63 MiB/s | 1.84× **faster** |
| scalar_heavy | 128.71 MiB/s | 218.42 MiB/s | 215.78 MiB/s | 1.01× **faster** |
| mixed | 64.21 MiB/s | 129.63 MiB/s | 113.94 MiB/s | 1.14× **faster** |

### Throughput — real-world (Kubernetes Deployment, ~3 KB)

#### Criterion output

```
throughput_real/rlsp/load             time: [46.078 µs 46.153 µs 46.233 µs]  thrpt: [80.221 MiB/s 80.359 MiB/s 80.491 MiB/s]
throughput_real/rlsp_events/parse_events time: [26.095 µs 26.139 µs 26.185 µs]  thrpt: [141.64 MiB/s 141.89 MiB/s 142.13 MiB/s]
throughput_real/libfyaml/parse_events time: [27.658 µs 27.696 µs 27.735 µs]  thrpt: [133.72 MiB/s 133.91 MiB/s 134.09 MiB/s]
```

| API | Time (median) | Throughput |
|-----|-------------:|----------:|
| rlsp/load | 46.153 µs | 80.36 MiB/s |
| rlsp/parse_events | 26.139 µs | 141.89 MiB/s |
| libfyaml/parse_events | 27.696 µs | 133.91 MiB/s |

### Latency — full event drain

#### Criterion output

```
latency/rlsp_full/parse_events/tiny_100B    time: [1.1894 µs 1.1921 µs 1.1946 µs]
latency/rlsp_full/parse_events/medium_10KB  time: [80.309 µs 80.465 µs 80.616 µs]
latency/rlsp_full/parse_events/large_100KB  time: [758.78 µs 760.28 µs 761.70 µs]

latency_real/rlsp_full/parse_events         time: [25.841 µs 25.882 µs 25.928 µs]
latency_real/libfyaml_full/parse_events     time: [27.046 µs 27.090 µs 27.139 µs]
```

| Fixture | rlsp/parse_events | libfyaml/parse_events |
|---------|----------------------:|----------------------:|
| tiny_100B | 1.192 µs | 2.672 µs |
| medium_10KB | 80.47 µs | 90.97 µs |
| large_100KB | 760.3 µs | 821.4 µs |
| kubernetes_3KB | 25.88 µs | 27.09 µs |

### Memory allocation profile

Measured with `CountingAllocator`; single parse in a release build.

#### Criterion output

```
memory/rlsp_load/load/tiny_100B    time: [2.0672 µs 2.0730 µs 2.0794 µs]
memory/rlsp_load/load/medium_10KB  time: [166.58 µs 167.03 µs 167.55 µs]
memory/rlsp_load/load/large_100KB  time: [1.5788 ms 1.5843 ms 1.5905 ms]

memory/rlsp_parse_events/parse_events/tiny_100B    time: [1.2015 µs 1.2025 µs 1.2037 µs]
memory/rlsp_parse_events/parse_events/medium_10KB  time: [81.004 µs 81.056 µs 81.109 µs]
memory/rlsp_parse_events/parse_events/large_100KB  time: [749.12 µs 749.58 µs 750.06 µs]

memory/alloc_stats/large_load      time: [1.5786 ms 1.5837 ms 1.5887 ms]
memory/real_world/load             time: [48.787 µs 48.854 µs 48.927 µs]
```

> **Note:** The memory benchmarks instrument wall-clock time (to measure allocation overhead)
> rather than reporting byte counts directly. The counting allocator intercepts every allocation
> during parse; the timing reflects the overhead of that tracking.

## Analysis

### O(1) first-event latency achieved

The streaming parser yields its first event in ~42.4 ns regardless of document size. This is the
primary design goal: the LSP server can begin producing diagnostics before a large document
is fully parsed.

The huge_1MB fixture first-event latency is 42.36 ns — ~23,600× under the 1 ms acceptance
criterion. libfyaml achieves ~749 ns first-event latency; the streaming parser is ~18× faster
(748.97 ns ÷ 42.36 ns = 17.7×).

### Throughput vs libfyaml: faster on all 10 event-drain fixtures

For the event-drain comparison (apples to apples), rlsp is faster than libfyaml on every fixture:

- `tiny_100B` 2.36× — libfyaml's FFI setup overhead dominates at this size
- `medium_10KB` 1.15×
- `large_100KB` 1.12×
- `huge_1MB` 1.13×
- `block_heavy` 1.12×
- `block_sequence` 1.01×
- `flow_heavy` 1.84× — flow collections are rlsp's strongest style
- `scalar_heavy` 1.01×
- `mixed` 1.14×
- `kubernetes` 1.06× (real-world)

### Real-world: rlsp faster than libfyaml on the Kubernetes fixture

For the Kubernetes Deployment manifest (the most representative LSP fixture), rlsp events
throughput is 141.89 MiB/s vs libfyaml's 133.91 MiB/s — rlsp is **6% faster** than the C
reference implementation. First-event latency is ~42.7 ns; full-document parse is 26.14 µs
(libfyaml: 27.70 µs).

### Allocation dominance in load pipeline

A flame graph of the `block_heavy` load benchmark (`flame-block_heavy-load.svg`) shows that
~87% of wall-clock time is in glibc `malloc`/`free` — ~59% in allocation (`__libc_malloc` +
`__libc_malloc2`) and ~28% in deallocation (`_int_free_chunk` → `free_perturb`). This confirms
that the load pipeline is allocation-bound for deeply nested block structures, and explains why
`block_heavy` has the lowest load throughput (53.46 MiB/s) among style fixtures. The event-drain
API avoids most of this overhead by not building a persistent tree.

### Trade-off: correctness and span fidelity vs raw speed

libfyaml is a production C library optimized for speed. rlsp-yaml-parser is a spec-faithful
Rust implementation that preserves lossless byte-range spans (now as compact `u32` offsets with
on-demand line/column resolution via `LineIndex`) and comments — information that libfyaml
discards. Despite this additional fidelity, rlsp is now faster on all event-drain fixtures.

### History

The 2026-04-16 baremetal numbers reflected an 8-commit performance campaign run on commit `3bec2da`:
L5 `9370579`, L2 `d9afbdf`, L7 `3f493a8`, L1 `a506589`, L3 `d586012`, L6 `8097aa5`,
L4 scoped `e812232`, L7b `3bec2da`. This campaign closed the container-vs-baremetal regression
and narrowed the libfyaml gap from "slower on 2–3 fixtures by 10%+" to "slightly behind on 2
fixtures by ≤5%".

The 2026-04-27 numbers reflect a second performance campaign across two plans:
- `2026-04-26-parser-perf-recover-tag-allocations.md`: `Cow::Borrowed` tag URIs (`3f15780`),
  first-byte schema dispatch (`a7206f6`).
- `2026-04-26-parser-perf-recover-node-event-meta-box.md`: `Option<Box<NodeMeta>>` (`d853605`),
  `Option<Box<EventMeta>>` (`76904a9`), lazy `Span` via `LineIndex` (`716771f`),
  `step_in_document` byte-dispatch (`9bd368e`), `#[inline]` on hot-path functions (`ccdfc1a`).

Key struct size reductions: `Node<Span>` 288 → 120 bytes, `Event` ~112 → 40 bytes,
`Span` 48 → 8 bytes. Combined effect: rlsp is now faster than libfyaml on all 10 event-drain
fixtures (was faster on 5, parity on 3, behind on 2). Load throughput improved 8–56% across
fixtures. First-event latency increased slightly (38.9 → 40.9 ns, +5%) — the cost of the new
`Option<Box<EventMeta>>` construction per event — but remains ~24,400× under the 1 ms acceptance
criterion and ~19× faster than libfyaml.

The 2026-05-13 numbers are a re-run on the same baremetal hardware with no code changes — a
measurement refresh to track baseline drift. Absolute throughput is ~5–10% lower across the
board (both rlsp and libfyaml), consistent with background system load or microcode/firmware
differences. Relative performance is unchanged: rlsp remains faster than libfyaml on all 10
event-drain fixtures. First-event latency is ~42.4 ns (was ~40.9 ns), still ~23,600× under
the 1 ms threshold. A flame graph of `block_heavy` load confirms allocation dominance (~87%
in malloc/free), identifying the glibc allocator as the primary bottleneck for the load pipeline.