# libgraphql-parser Benchmark Optimizations
Tracker for performance optimization opportunities in the lexer and parser.
Each entry documents the problem, fix, trade-offs, and (once implemented)
benchmark results.
Status legend: **Pending** | **Completed** | **Skipped**
---
## B1: `peek_char()` uses `remaining().chars().nth(0)` on every character [CRITICAL]
**Status:** Completed
**Priority:** 1 (highest bang-for-buck)
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Date:** 2026-02-08
**Problem:** Every character peek constructs a `&str` slice via `remaining()`,
creates a `Chars` iterator, and walks to the nth element. Called millions of
times for large inputs (every `consume()`, `skip_whitespace()`, `lex_name()`,
`lex_comment()`, `lex_block_string()`, `next_token()`).
**Change made:** Replaced both `peek_char()` and `consume()` with ASCII fast
paths. `peek_char()` does direct byte indexing + `is_ascii()` check instead
of creating a `Chars` iterator. `consume()` skips `ch.len_utf8()` and
`ch.len_utf16()` calls for ASCII (known to be 1 byte / 1 code unit). Non-ASCII
falls back to full UTF-8 decoding.
**Trade-offs:** Adds ASCII vs non-ASCII branch; branch prediction strongly
favors ASCII. `peek_char_nth(n)` for n>0 still needs iterator approach.
**Benchmark results (clean run, both before/after on AC power):**
Lexer-only (isolated lexer performance, most reliable signal):
| Fixture | Before | After | Change |
|-----------------------|----------|----------|------------|
| lexer/github_schema | 8.089ms | 7.552ms | **-6.6%** |
| lexer/large_schema | 6.483ms | 6.234ms | **-3.8%** |
| lexer/starwars_schema | 40.60us | 37.20us | **-6.6%** |
| lexer/medium_schema | 1.381ms | 1.335ms | **-3.5%** |
| lexer/small_schema | 28.95us | 28.15us | **-2.9%** |
Full schema parse (lexer + parser combined):
| Fixture | Before | After | Change |
|----------------------|----------|----------|------------|
| schema_parse/github | 23.01ms | 22.33ms | **-3.0%** |
| schema_parse/large | 24.71ms | 24.73ms | ~0% |
| schema_parse/medium | 4.961ms | 4.985ms | ~0% |
| schema_parse/starwars | 87.23us | 91.38us | +4.8% (*) |
| schema_parse/small | 44.34us | 43.83us | -1.3% |
Cross-parser comparison (libgraphql_parser only):
| Fixture | Before | After | Change |
|-------------------------------------|----------|----------|------------|
| compare_schema_parse/.../github | 22.89ms | 22.47ms | **-1.9%** |
| compare_schema_parse/.../large | 24.59ms | 24.61ms | ~0% |
| compare_schema_parse/.../medium | 4.997ms | 4.974ms | -0.5% |
| compare_schema_parse/.../starwars | 77.14us | 85.22us | +10.5% (*) |
| compare_schema_parse/.../small | 83.99us | 81.21us | -2.6% |
(*) The starwars parse regression is anomalous: the lexer for starwars
clearly improved by -6.6%, and control parsers (graphql_parser,
apollo_parser) showed 0-1.5% random drift. This appears to be
measurement noise on the small (~4KB) fixture where variance is high.
**Machine:** Apple M2 Max, 12 cores, 64 GB RAM, macOS (Darwin 23.6.0, arm64)
**Rust:** rustc 1.90.0-nightly (0d9592026 2025-07-19)
**Verdict:** Consistent 3-7% lexer improvement across all fixture sizes.
Full parse shows ~2-3% improvement on the largest real-world input
(github). Keeping.
---
## B2: `consume()` does per-character position tracking [HIGH]
**Status:** Completed (all 4 sub-optimizations kept)
**Priority:** 6
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Date:** 2026-02-09
**Problem:** Every character consumed updates 5-6 fields (peek_char, newline
check, curr_col_utf8, curr_col_utf16, last_char_was_cr, curr_byte_offset).
For a name like `PullRequestReviewCommentConnection` that's 36 chars x 6 ops.
**Change made:** Implemented byte-scanning fast paths for 4 hot lexer
methods. Each scans raw bytes in a tight loop (one branch per byte)
and batch-updates position tracking once at the end. Shared helper
`compute_columns_for_span()` handles ASCII fast path for column
computation (ASCII byte count = char count = UTF-16 unit count).
The approach is safe for multi-byte UTF-8 because the sentinel bytes
(`"`, `\`, `\n`, `\r`) are all ASCII (<0x80) and can never appear as
continuation bytes in multi-byte UTF-8 sequences (which are >=0x80).
Sub-optimizations (each a separate commit):
1. **`lex_name()`** — Byte-scan `[_0-9A-Za-z]` pattern. Names are
ASCII-only by spec, no newlines, so column = byte count.
2. **`skip_whitespace()`** — Byte-scan ` `, `\t`, `\n`, `\r`, BOM.
Tracks newline positions and BOM count for column computation.
3. **`lex_comment()`** — Byte-scan to `\n`/`\r`/EOF. Comments are
single-line, so only column advances. Uses
`compute_columns_for_span()` for potential non-ASCII content.
4. **`lex_block_string()`** — Byte-scan for `"`, `\`, `\n`, `\r`
sentinels, skip everything else with `i += 1`. Tracks newlines
for position reconstruction via `compute_columns_for_span()`.
**Trade-offs:** More complex position tracking logic (batch vs
per-char). `skip_whitespace()` tracks BOM count for correct column
math. `lex_block_string()` uses `compute_columns_for_span()` which
iterates after-last-newline range (but has ASCII fast path).
**Benchmark results (back-to-back, both on AC power):**
Machine: Apple M2 Max, 12 cores, 64 GB RAM, macOS (Darwin 23.6.0, arm64)
Rust: rustc 1.90.0-nightly (0d9592026 2025-07-19)
Controls: graphql_parser ±0-1.6%, apollo_parser ±0-1.9% — clean
measurement, all changes attributable to our code.
### Net results (all 4 sub-optimizations combined vs B5 baseline)
Schema parse:
| Fixture | Before | After | Change |
|----------|-----------|-----------|--------------|
| small | 43.0 µs | 37.8 µs | **-12.1%** |
| medium | 2.07 ms | 1.81 ms | **-12.6%** |
| large | 9.65 ms | 8.40 ms | **-12.9%** |
| starwars | 53.4 µs | 42.8 µs | **-19.5%** |
| github | 12.6 ms | 10.5 ms | **-16.9%** |
Executable parse:
| Fixture | Before | After | Change |
|------------------|-----------|-----------|--------------|
| simple_query | 1.94 µs | 1.73 µs | **-10.9%** |
| complex_query | 35.8 µs | 31.7 µs | **-11.2%** |
| nested_depth_10 | 7.72 µs | 6.2 µs | **-19.8%** |
| nested_depth_30 | 28.1 µs | 18.5 µs | **-34.2%** |
| many_ops_50 | 141 µs | 131 µs | **-7.2%** |
Lexer-only (isolates lexer changes):
| Fixture | Before | After | Change |
|-----------------|-----------|-----------|--------------|
| small_schema | 28.5 µs | 25.8 µs | **-6.3%** |
| medium_schema | 1.35 ms | 1.22 ms | **-9.3%** |
| large_schema | 6.30 ms | 5.69 ms | **-9.6%** |
| starwars_schema | 37.6 µs | 29.9 µs | **-20.2%** |
| github_schema | 7.68 ms | 5.85 ms | **-23.8%** |
Cross-parser comparison (schema parse, after B2):
| Fixture | libgraphql | graphql_parser | apollo_parser |
|----------|--------------|----------------|---------------|
| small | **37.9 µs** | 47.1 µs | 48.8 µs |
| medium | **1.82 ms** | 2.09 ms | 2.24 ms |
| large | **8.41 ms** | 9.63 ms | 10.7 ms |
| starwars | **42.8 µs** | 52.9 µs | 58.4 µs |
| github | 10.5 ms | **9.46 ms** | 14.1 ms |
Cross-parser comparison (executable parse, after B2):
| Fixture | libgraphql | graphql_parser | apollo_parser |
|----------|--------------|----------------|---------------|
| simple | **1.74 µs** | 3.02 µs | 3.17 µs |
| complex | **31.7 µs** | 41.9 µs | 41.0 µs |
### Bisection — marginal contribution of each sub-optimization
Marginal % = this commit's incremental effect (cumulative minus
previous cumulative). Values within ±3% are within measurement
noise and marked with ~.
schema_parse marginals:
| Fixture | lex_name | skip_ws | lex_comment | block_string |
|----------|------------|----------|--------------|--------------|
| small | **-5.9%** | ~ | ~ | **-7.0%** |
| medium | **-7.9%** | ~ | ~ | **-3.1%** |
| large | **-10.5%** | ~ | ~ | **-4.6%** |
| starwars | **-4.1%** | ~ | **-12.9%** | ~ |
| github | ~ | ~ | ~ | **-10.5%** |
executable_parse marginals:
| Fixture | lex_name | skip_ws | lex_comment | block_string |
|------------------|------------|-------------|-------------|--------------|
| simple_query | **-6.8%** | ~ | ~ | ~ |
| complex_query | **-8.9%** | ~ | ~ | ~ |
| nested_depth_10 | **-7.9%** | **-10.1%** | ~ | ~ |
| nested_depth_30 | **-3.2%** | **-29.3%** | ~ | ~ |
| many_ops_50 | **-5.9%** | ~ | ~ | ~ |
lexer marginals:
| Fixture | lex_name | skip_ws | lex_comment | block_string |
|-----------------|------------|----------|--------------|---------------|
| small_schema | ~ | ~ | ~ | ~ |
| medium_schema | **-4.1%** | ~ | ~ | **-3.3%** |
| large_schema | **-3.6%** | ~ | ~ | ~ |
| starwars_schema | ~ | ~ | **-20.8%** | ~ |
| github_schema | **-4.0%** | ~ | ~ | **-18.8%** |
### Per sub-optimization assessment
**lex_name:** Broad, consistent 4-11% improvement across schema and
executable parsing. Names are the most frequent token type — every
identifier, type name, field name, keyword benefits.
**skip_whitespace:** Dramatic 10-29% improvement on deeply-nested
executable parsing (depth_10, depth_30) where whitespace-heavy
indentation dominates. Negligible on other fixtures. The nested
fixtures have proportionally more whitespace due to deep indentation.
**lex_comment:** 13-21% improvement on starwars fixture (comment-
heavy). Negligible on other fixtures which have few `#` comments.
The starwars schema has extensive `#`-style comments throughout.
**lex_block_string:** 3-19% improvement on schema fixtures with
block string descriptions (github has 3,246 descriptions). The
github lexer improvement (-18.8%) is particularly striking. No
effect on executable parsing (queries don't typically contain block
strings).
**Verdict:** All 4 sub-optimizations kept. Each targets a different
token type and shows clear signal above noise on fixtures where that
token type is prevalent. No regressions detected on any fixture.
libgraphql-parser now leads graphql_parser and apollo_parser on all
schema fixtures except github (where graphql_parser still leads by
~10%). On executable parsing, libgraphql-parser leads by ~1.7-1.8x.
---
## B3: Block string parsing allocates heavily [HIGH]
**Status:** Completed
**Priority:** 2
**File:** `src/token/graphql_token_kind.rs`
**Date:** 2026-02-08
**Problem:** `parse_block_string()` is called for every description. For the
GitHub schema (~3,246 block strings), each call does:
1. `content.replace("\\\"\"\"", "\"\"\"")` — always allocates even when no
escaped triple quotes exist (common case)
2. `content.lines().collect::<Vec<&str>>()` — allocates a Vec
3. `Vec::with_capacity(lines.len())` of `String` — allocates Vec of Strings
4. `.to_string()` per line — heap allocation per line
5. `result_lines.remove(0)` — O(n) shift
6. `result_lines.join("\n")` — final allocation
~6 allocations per description x 3,246 = ~19,000+ heap allocations.
**Change made:** Rewrote `parse_block_string()` as a two-pass, low-allocation
algorithm:
- `Cow::Borrowed` fast path skips `replace()` when no `\"""` present
- Pass 1 iterates `str::lines()` lazily to compute common indent and
first/last non-blank line indices (no Vec allocation)
- Pass 2 iterates `str::lines()` again, writing stripped lines directly
into a single pre-allocated `String`
- No `Vec<String>`, no `remove(0)`, no `join()` — just one `String`
allocation for the entire result
**Trade-offs:** More complex two-pass logic (two `str::lines()` iterations
instead of one collect). Must preserve exact spec semantics for edge cases.
**Benchmark results (clean back-to-back runs):**
Schema parse (full parse, lexer + parser):
| Fixture | Before | After | Change |
|-----------------------|----------|----------|-----------|
| schema_parse/github | 22.35ms | 21.55ms | **-3.6%** |
| schema_parse/large | 24.61ms | 23.96ms | **-2.6%** |
| schema_parse/medium | 4.917ms | 4.816ms | **-2.0%** |
| schema_parse/small | 43.70us | 41.84us | **-4.0%** |
| schema_parse/starwars | 90.61us | 93.14us | +2.6% (*) |
Cross-parser comparison (libgraphql_parser only):
| Fixture | Before | After | Change |
|------------------------------------|----------|----------|-----------|
| compare_.../libgraphql_.../github | 22.72ms | 21.35ms | **-3.0%** |
| compare_.../libgraphql_.../large | 24.49ms | 24.28ms | ~0% |
| compare_.../libgraphql_.../medium | 4.916ms | 4.834ms | ~0% |
| compare_.../libgraphql_.../small | 79.67us | 75.22us | **-6.0%** |
Executable parse (B3 benefits string-heavy queries):
| Fixture | Before | After | Change |
|--------------------------|----------|----------|--------------|
| executable_parse/complex | 72.30us | 54.46us | **-25.4%** |
| compare_.../complex | 72.93us | 68.41us | **-4.3%** |
Lexer (expected: no change — B3 is parser-level):
| Fixture | Before | After | Change |
|---------------------|----------|----------|--------|
| lexer/github_schema | 7.550ms | 7.526ms | ~0% |
| lexer/large_schema | 6.212ms | 6.220ms | ~0% |
(*) The starwars regression is anomalous: control parsers showed 0%
drift, and the starwars schema has few descriptions (B3 should be
irrelevant there). This appears to be measurement noise.
**Machine:** Apple M2 Max, 12 cores, 64 GB RAM, macOS (Darwin 23.6.0, arm64)
**Rust:** rustc 1.90.0-nightly (0d9592026 2025-07-19)
**Verdict:** Clear improvement on description-heavy inputs (github
-3%, complex query -25%). Lexer unaffected as expected. Keeping.
---
## B4: `name.into_owned()` forces heap allocation for every identifier [DEFERRED]
**Status:** Pending (deferred — requires significant architectural changes)
**Priority:** — (future work)
**File:** `src/graphql_parser.rs`, `src/ast.rs`
**Problem:** AST types use `String` for identifiers. Every name is converted
from `Cow::Borrowed(&str)` to owned `String` via `into_owned()`. For the
GitHub schema with ~70,000+ identifiers, that's ~70,000 heap allocations.
**Suggested fix (long-term):** Define native AST types with `Cow<'src, str>`.
**Suggested fix (short-term):** String interning / arena allocation.
**Trade-offs:** Major architectural refactor. Lifetime `'src` propagates
through all AST consumers.
**Est. impact:** HIGHEST — but deferred due to scope
---
## B5: Token clone in `expect()` [CRITICAL]
**Status:** Completed
**Priority:** 5
**File:** `src/graphql_parser.rs`, `src/graphql_token_stream.rs`
**Date:** 2026-02-09
**Problem:** `expect()` cloned the peeked token before consuming.
Clone included `GraphQLTokenKind` enum (with `Cow<str>`),
`SmallVec<[GraphQLTriviaToken; 2]>` trivia, `GraphQLSourceSpan`.
Called for every punctuator — tens of thousands of times per schema.
**Change made:** Replaced `Vec`+index buffer in `GraphQLTokenStream`
with `VecDeque` ring buffer. `consume()` now returns
`Option<GraphQLToken>` (owned) via O(1) `pop_front()`. Eliminated:
- Token clone in `expect()` (cloned full GraphQLToken per punctuator)
- `Cow<str>` clone in `expect_name_only()` (now moves from owned token)
- Span clone in `expect_keyword()` success path
- Span clone in `parse_description()` error path
Removed `compact_buffer()` (VecDeque naturally discards consumed
tokens) and `current_token()` (callers use `consume_token()` return
directly). Parser tracks `last_end_position: Option<SourcePosition>`
for EOF error anchoring.
**Trade-offs:** Changed `GraphQLTokenStream` API — removed
`current_token()` and `compact_buffer()`. All ~50
`self.token_stream.consume()` call sites updated to
`self.consume_token()` wrapper that tracks end position.
**Benchmark results (clean back-to-back, both on AC power):**
Machine: Apple M-series arm64, macOS
Rust: rustc 1.90.0-nightly (0d9592026 2025-07-19)
Controls confirm clean measurement: lexer benchmarks all within
±0.3–1.8% (no parser changes expected). Competitor parsers
(graphql_parser, apollo_parser) within ±0–3% noise.
Standalone schema_parse (libgraphql only):
| Fixture | Before | After | Change |
|----------|-----------|----------|--------------|
| small | 42.1 µs | 42.1 µs | ~0% |
| medium | 5.25 ms | 2.04 ms | **-61.1%** |
| large | 20.3 ms | 9.50 ms | **-53.2%** |
| starwars | 92.7 µs | 52.1 µs | **-42.8%** |
| github | 21.7 ms | 12.4 ms | **-42.7%** |
Standalone executable_parse (libgraphql only):
| Fixture | Before | After | Change |
|------------------|-----------|-----------|--------------|
| simple_query | 1.93 µs | 1.91 µs | -1.6% |
| complex_query | 70.8 µs | 34.9 µs | **-51.1%** |
| nested_depth_10 | 8.25 µs | 7.54 µs | **-9.0%** |
| nested_depth_30 | 61.1 µs | 27.6 µs | **-54.8%** |
| many_ops_50 | 198.7 µs | 138.3 µs | **-30.1%** |
Cross-parser comparison (schema parse, after B5):
| Fixture | libgraphql | graphql_parser | apollo_parser |
|----------|-------------|----------------|---------------|
| small | **42.0 µs** | 46.6 µs | 48.3 µs |
| medium | **2.05 ms** | 2.06 ms | 2.20 ms |
| large | **9.49 ms** | 9.48 ms | 10.6 ms |
| starwars | 52.1 µs | **52.6 µs** | 57.6 µs |
| github | 12.5 ms | **9.35 ms** | 13.9 ms |
Cross-parser comparison (executable parse, after B5):
| Fixture | libgraphql | graphql_parser | apollo_parser |
|---------|-------------|----------------|---------------|
| simple | **1.91 µs** | 3.03 µs | 3.13 µs |
| complex | **34.9 µs** | 41.3 µs | 40.6 µs |
**Verdict:** Massive improvement — original "MODERATE" estimate was
dramatically wrong. Token cloning was the dominant parser bottleneck.
libgraphql-parser is now competitive with or faster than both
graphql_parser and apollo_parser on most benchmarks. The original
2–2.5x gap is closed to 1.0–1.3x across all fixtures.
---
## B6: `starts_with()` in block string lexing [MODERATE]
**Status:** Skipped (no measurable improvement — reverted)
**Priority:** 4
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Date:** 2026-02-09
**Problem:** Inside the block string lexer, every character checks:
```rust
self.remaining().starts_with("\\\"\"\"")
self.remaining().starts_with("\"\"\"")
```
`remaining()` creates a new slice each time. Adds up for long block strings.
Also replaced in `lex_string()` for the block string detection check.
**Change attempted:** Added `next_is_triple_quote()` and
`next_is_escaped_triple_quote()` helper methods that use direct byte
indexing into `self.source.as_bytes()`. Also removed unnecessary
`#[inline]` from `peek_char()` (inlining is already handled by LLVM
for crate-local calls).
**Benchmark results (two independent back-to-back runs):**
Lexer-only (B6 targets lexer — most direct signal):
| Fixture | Before (r1) | After (r1) | Before (r2) | After (r2) |
|-----------------------|-------------|------------|-------------|------------|
| lexer/github_schema | 7.650ms | 7.741ms | 7.543ms | 7.562ms |
| lexer/large_schema | 6.287ms | 6.327ms | 6.277ms | 6.230ms |
| lexer/medium_schema | 1.328ms | 1.345ms | 1.321ms | 1.349ms |
| lexer/small_schema | 28.73us | 28.13us | 28.06us | 27.83us |
| lexer/starwars_schema | 37.60us | 37.83us | 37.10us | 37.20us |
Cross-parser comparison (controlled — all parsers in same run):
| Fixture | Before (r1) | After (r1) | Before (r2) | After (r2) |
|-----------------------------------|-------------|------------|-------------|------------|
| compare_.../libgraphql_.../github | 21.48ms | 21.35ms | 21.30ms | 21.33ms |
| compare_.../libgraphql_.../large | 24.50ms | 24.36ms | 24.08ms | 24.11ms |
| compare_.../libgraphql_.../medium | 4.823ms | 4.746ms | 4.813ms | 4.878ms |
| compare_.../graphql_parser/github | 9.600ms | 9.434ms | 9.463ms | 9.392ms |
| compare_.../graphql_parser/large | 9.661ms | 9.587ms | 9.533ms | 9.541ms |
| compare_.../apollo_parser/github | 14.93ms | 14.01ms | 15.01ms | 14.04ms |
| compare_.../apollo_parser/large | 11.05ms | 10.74ms | 10.70ms | 10.74ms |
Control parsers show the same magnitude of drift as libgraphql_parser
across both runs, confirming B6 has no effect above the noise floor.
**Machine:** Apple M2 Max, 12 cores, 64 GB RAM, macOS (Darwin 23.6.0, arm64)
**Rust:** rustc 1.90.0-nightly (0d9592026 2025-07-19)
**Verdict:** No measurable performance change across two independent
back-to-back runs. `starts_with()` for short literal patterns is
already well-optimized by the compiler. Code changes reverted.
---
## B7: `shrink_to_fit()` in `compact_buffer()` [LOW-MODERATE]
**Status:** Completed
**Priority:** 3
**File:** `src/graphql_token_stream.rs`
**Date:** 2026-02-08
**Problem:** After every buffer compaction (once per top-level definition),
`shrink_to_fit()` may trigger a reallocation to shrink Vec capacity, only for
the buffer to grow again for the next definition. For 1000+ definitions,
that's 1000+ potential realloc cycles.
**Change made:** Removed `shrink_to_fit()` call. Buffer retains capacity
between definitions.
**Trade-offs:** Slightly higher peak memory (~few KB retained). Negligible.
**Benchmark results (clean back-to-back runs):**
Schema parse (full parse, lexer + parser):
| Fixture | Before | After | Change |
|-----------------------|----------|----------|-------------|
| schema_parse/github | 21.55ms | 19.43ms | **-9.9%** |
| schema_parse/large | 24.34ms | 21.91ms | **-10.0%** |
| schema_parse/medium | 4.769ms | 4.155ms | **-12.9%** |
| schema_parse/small | 42.29us | 42.41us | ~0% |
| schema_parse/starwars | 91.03us | 89.64us | -1.3% |
Lexer (expected: no change — B7 is parser-level):
| Fixture | Before | After | Change |
|---------------------|----------|----------|--------|
| lexer/github_schema | 7.679ms | 7.646ms | ~0% |
| lexer/large_schema | 6.430ms | 6.318ms | -1.7% |
Impact scales with number of top-level definitions: medium (~200
types) and large (~1000 types) showed the biggest gains. Small
schemas with few definitions showed no change, confirming that the
improvement comes from reduced realloc churn in the compaction loop.
**Machine:** Apple M2 Max, 12 cores, 64 GB RAM, macOS (Darwin 23.6.0, arm64)
**Rust:** rustc 1.90.0-nightly (0d9592026 2025-07-19)
**Verdict:** Unexpectedly large improvement (10-13% on medium/large
schemas). Much bigger than the estimated LOW-MODERATE. Keeping.
---
## B8: No `[profile.bench]` in workspace Cargo.toml [LOW — MEASUREMENT ONLY]
**Status:** Pending
**Priority:** 7
**File:** `Cargo.toml` (workspace root)
**Problem:** No benchmark-specific profile. Adding LTO and single codegen unit
helps cross-crate inlining.
**Suggested fix:**
```toml
[profile.bench]
lto = "thin"
codegen-units = 1
```
**Trade-offs:** Slower benchmark compilation. Affects all parsers equally in
comparative benchmarks.
**Est. impact:** LOW for relative comparisons, potentially MODERATE for absolute
---
## B9: Dual UTF-16 column tracking on every character [LOW]
**Status:** Pending (subsumed by B2 if adopted)
**Priority:** 9
**File:** `src/token_source/str_to_graphql_token_source.rs:202`
**Problem:** Every non-newline char updates `curr_col_utf16 += ch.len_utf16()`.
For ASCII `len_utf16()` always returns 1. Small per-char overhead.
**Suggested fix:** Make UTF-16 tracking opt-in via constructor flag, or defer
to B2's lazy position computation.
**Trade-offs:** API change; consumers needing UTF-16 must opt in.
**Est. impact:** LOW standalone — subsumed by B2
---
## B11: Remove Box from syntax structs (inline them) [REVERTED]
**Status:** Reverted — catastrophic regression
**Date:** 2026-03-04
**Hypothesis:** Eliminating ~48 `Box::new()` heap allocations per AST node
construction would reduce allocation overhead, especially for large schemas
(~100K+ nodes for github).
**Change:** `Option<Box<XyzSyntax<'src>>>` → `Option<XyzSyntax<'src>>` across all
42 AST struct files + parser. Removed all `Box::new()` calls except the recursive
`element_type` in `ListTypeAnnotation`.
**Result:** Massive regression across the board.
| Benchmark | B11 | A.1 Baseline | Change |
|------------------|---------|--------------|--------|
| medium (default) | 5.87 ms | 3.86 ms | +52% |
| large (default) | 31.4 ms | 21.0 ms | +50% |
| github (default) | 23.5 ms | 19.5 ms | +20% |
| medium (lean) | 5.49 ms | ~1.92 ms | +186% |
| github (lean) | 22.0 ms | ~13.0 ms | +70% |
**Root cause:** `Option<Box<T>>` with `None` = 8 bytes (null pointer).
`Option<T>` with `None` = full `size_of::<T>()`. Inlining syntax structs bloated
every AST node — even in lean mode where syntax is always `None`. The larger
structs destroyed cache locality, overwhelming any savings from fewer heap
allocations. Lean mode was hit hardest (+150-400%) because every `None` field
now carries the full struct weight instead of a null pointer.
**Lesson:** `Option<Box<T>>` is the correct pattern for "expensive when present,
free when absent" optional data. The per-allocation cost of `Box::new()` (~25ns)
is far less than the cache miss penalty from bloated structs.
---
## B14: #[inline] on hot parser functions [REVERTED]
**Status:** Reverted — no statistically significant improvement
**Date:** 2026-03-04
**Change:** Added `#[inline]` to `peek_is_keyword()`, `peek_is()`,
`consume_token()`, `make_span()`, `make_span_ref()`, `token_kinds_match()`.
**Result:** No measurable improvement. LLVM already makes good inlining decisions
for intra-crate functions.
---
## B16: Vec::with_capacity() hints [REVERTED]
**Status:** Reverted — regression (7-14% on schema, 13-19% on lean)
**Date:** 2026-03-04
**Change:** Replaced ~38 `Vec::new()` calls with `Vec::with_capacity(N)` using
typical-size hints.
**Root cause:** `Vec::new()` is zero-allocation until first `push()`.
`Vec::with_capacity(N)` allocates immediately. For frequently-empty Vecs (like
`directives` on most nodes), pre-allocation wastes heap allocations that
`Vec::new()` avoids entirely.
---
## B10: SmallVec for commonly-small collection fields [SKIPPED]
**Status:** Skipped — not viable based on struct size analysis
**Date:** 2026-03-04
**Hypothesis:** Replace `Vec<T>` with `SmallVec<[T; N]>` for fields like
`directives`, `arguments`, and `implements` to avoid heap allocation for the
common 0-1 element case.
**Why we skipped it:** Struct size measurements revealed the core problem is
struct bloat, not allocation count. SmallVec inlines elements into the parent
struct, which would increase node sizes for the same cache-locality reasons that
killed B11. Key measurements:
| Type | Size (bytes) |
|------------------------|--------------|
| `Vec<T>` (any T) | 24 |
| `DirectiveAnnotation` | 192 |
| `Argument` | 336 |
| `Name` | 96 |
| `GraphQLToken` | 504 |
| `GraphQLSourceSpan` | 64 |
`SmallVec<[DirectiveAnnotation; 1]>` would add ~168 bytes per node vs `Vec`'s
constant 24 bytes — for a field that is usually empty. Same cache-locality
destruction as B11, just smaller scale.
**Root cause (shared with B11, B16):** The real bottleneck was that
`GraphQLToken` was 504 bytes and syntax structs were 504-1,512 bytes. Allocation
count is not the problem; struct size is. B19 subsequently addressed the token
size issue by boxing the Error variant, shrinking `GraphQLToken` from 504 to
304 bytes. Further gains may come from shrinking spans/tokens (B12).
---
## B19: Box the Error variant of GraphQLTokenKind [COMPLETED]
**Status:** Completed
**Date:** 2026-03-04
**Files:** `src/token/graphql_token_kind.rs`, `src/token/mod.rs`,
`src/graphql_parser.rs`, `src/token_source/str_to_graphql_token_source.rs`
(+ corresponding files in `libgraphql-macros`)
**Problem:** `GraphQLTokenKind` was 232 bytes because its `Error` variant
contained `GraphQLErrorNotes` (`SmallVec<[GraphQLErrorNote; 2]>` = 208 bytes)
plus a `String` (24 bytes). Since Rust enums are sized by their largest
variant, every token — including simple `Name`, `IntValue`, and punctuator
tokens — paid the 232-byte cost of the `Error` variant. This bloated
`GraphQLToken` to 504 bytes (span 64B + kind 232B + preceding_trivia 208B).
**Change made:** Extracted the Error variant's payload into a separate
`GraphQLTokenError` struct and boxed it:
```rust
// Before: Error { message: String, error_notes: GraphQLErrorNotes }
// After:
Error(Box<GraphQLTokenError>)
pub struct GraphQLTokenError {
pub message: String,
pub error_notes: GraphQLErrorNotes,
}
```
The existing `GraphQLTokenKind::error()` constructor abstracts the boxing,
so most call sites use the constructor without needing to know about the
`Box`. Pattern match sites were updated from `Error { message, .. }` to
`Error(err)` with `err.message` access.
**Size reduction:**
| Type | Before (bytes) | After (bytes) | Reduction |
|--------------------|----------------|---------------|-----------|
| `GraphQLTokenKind` | 232 | 32 | **7.25x** |
| `GraphQLToken` | 504 | 304 | **1.66x** |
| `GraphQLTokenError`| 232 | 232 | (no change; now heap-allocated only on error) |
**Why this works:** Errors are rare during parsing — most tokens are names,
punctuators, and keywords. By boxing the error payload, the `Error` variant
shrinks from 232 bytes to a single pointer-sized value, and the enum's overall size
drops to 32 bytes (determined by the next-largest variant, `StringValue`
with `Cow<str>`). The `Box` allocation only occurs when an actual error is
emitted, which is negligible. Every non-error token benefits from the
smaller struct: faster VecDeque moves, better cache locality, smaller AST
nodes (which embed tokens in syntax structs).
**Trade-offs:** Error construction now requires a heap allocation. Since
errors are rare and always terminate parsing soon after, this is
negligible. Pattern matching on `Error` is slightly less ergonomic
(`Error(err)` instead of `Error { message, .. }`).
**Benchmark results (full run, 300 samples, 20s measurement, 0.99 confidence):**
Machine: Apple M2 Max, 12 cores, 64 GB RAM, macOS (Darwin 23.6.0, arm64)
Rust: rustc 1.90.0-nightly (0d9592026 2025-07-19)
All CI widths under 2%, indicating highly reproducible measurements.
### Schema parse (standalone, default config)
Comparison against A.1 baseline (post-AST-regression, pre-B19) where
available. A.1 baseline values from the B11 entry.
| Fixture | A.1 Baseline | After B19 | Change |
|---------------|--------------|-----------|--------------|
| small | — | 37.7 µs | — |
| medium | 3.86 ms | 1.81 ms | **-53.1%** |
| large | 21.0 ms | 8.40 ms | **-60.0%** |
| starwars | — | 59.2 µs | — |
| github | 19.5 ms | 14.9 ms | **-23.6%** |
| shopify_admin | — | 29.6 ms | — |
### Schema parse (standalone, lean mode)
| Fixture | After B19 |
|---------------|-----------|
| starwars | 38.2 µs |
| github | 8.58 ms |
| shopify_admin | 17.4 ms |
### Executable parse (standalone, default config)
| Fixture | After B19 |
|---------------|-----------|
| simple_query | 2.51 µs |
| complex_query | 45.3 µs |
| nested_10 | 9.00 µs |
| nested_30 | 29.0 µs |
### Executable parse (standalone, lean mode)
| Fixture | After B19 |
|---------------|-----------|
| simple_query | 1.39 µs |
| complex_query | 25.3 µs |
| nested_10 | 5.09 µs |
| nested_30 | 16.3 µs |
### Cross-parser comparison (schema parse, after B19)
| Fixture | libgraphql | graphql_parser | apollo_parser |
|---------------|---------------|----------------|---------------|
| small | 47.4 µs | **44.0 µs** | 46.0 µs |
| medium | 2.97 ms | **1.99 ms** | 2.08 ms |
| large | 15.3 ms | **9.15 ms** | 9.95 ms |
| starwars | 59.3 µs | **50.6 µs** | 55.1 µs |
| github | 15.6 ms | **8.96 ms** | 12.9 ms |
| shopify_admin | 29.0 ms | **17.7 ms** | 27.3 ms |
### Cross-parser comparison (executable parse, after B19)
| Fixture | libgraphql | graphql_parser | apollo_parser |
|---------------|---------------|----------------|---------------|
| simple | **2.60 µs** | 2.89 µs | 3.02 µs |
| complex | 45.5 µs | 40.1 µs | **38.9 µs** |
### Lexer throughput (after B19)
| Fixture | Time | Throughput |
|---------------|----------|--------------|
| small | 13.9 µs | ~162 MiB/s |
| medium | 633 µs | ~159 MiB/s |
| large | 2.94 ms | ~162 MiB/s |
| starwars | 20.0 µs | ~199 MiB/s |
| github | 3.66 ms | ~319 MiB/s |
| shopify_admin | 7.76 ms | ~399 MiB/s |
**Verdict:** Massive improvement — the single largest optimization since B5.
Schema parsing improved 24-60% vs the A.1 baseline. `GraphQLToken` shrank
from 504 to 304 bytes (1.66x smaller), dramatically improving cache locality
and VecDeque throughput. libgraphql-parser now **wins on simple executable
queries** (2.60 µs vs 2.89 µs graphql-parser, 3.02 µs apollo-parser). Schema
parsing remains 1.5-1.7x behind graphql-parser on large schemas, but the gap
is significantly narrowed from the post-AST regression. The improvement even
surpasses the pre-AST performance for default-mode schema parsing (medium:
1.81 ms vs pre-AST 2.05 ms), confirming that the Error variant bloat was a
pre-existing bottleneck that was never addressed before.
---
## B20: `lex_string()` byte-scanning + memchr3 [HIGH]
**Status:** Completed
**Priority:** HIGH
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Problem:** `lex_string()` uses `peek_char()`/`consume()` per character to scan single-line string bodies. These calls do bounds checks and ASCII tests on every byte. The function looks for sentinel bytes (`"`, `\`, `\n`, `\r`) but scans byte-by-byte instead of using SIMD-accelerated search.
**Suggested fix:** Replace the `loop { match self.peek_char() { ... } }` body with byte-scanning using `memchr::memchr3(b'"', b'\\', b'\n', &bytes[i..])` to jump directly to the next interesting byte, skipping all regular string content at 16–32 bytes/cycle. Bare `\r` is checked in the gap between matches (extremely rare in practice).
**Trade-offs:** Must handle escape sequences carefully — after `\`, the next byte must also be skipped (it could be `"` or `\` itself). Must preserve the existing error reporting behavior for unterminated strings and unescaped newlines. Used `memchr3` (3 needles) instead of `memchr4` since the crate only supports up to 3.
**Est. impact:** HIGH — `lex_string` is the only lexer scanning function not yet using byte-scanning.
### Benchmark results (B20)
#### Schema parsing
| Fixture | Before | After | Delta |
|---------------|------------|------------|---------|
| small | 27.67 µs | 27.75 µs | ~0% |
| medium | 1.591 ms | 1.546 ms | -2.8% |
| large | 8.233 ms | 7.917 ms | -3.8% |
| starwars | 35.07 µs | 33.89 µs | -3.3% |
| github | 8.614 ms | 8.191 ms | -4.9% |
| shopify_admin | 15.86 ms | 15.30 ms | -3.6% |
#### Executable parsing
| Fixture | Before | After | Delta |
|--------------------|------------|------------|---------|
| simple_query | 1.820 µs | 1.786 µs | -1.7% |
| complex_query | 29.39 µs | 29.21 µs | -0.6% |
| nested_depth_10 | 6.193 µs | 6.080 µs | -2.0% |
| nested_depth_30 | 18.73 µs | 18.53 µs | -1.1% |
| many_operations_50 | 115.5 µs | 112.2 µs | -2.8% |
#### Lexer throughput
| Fixture | Before | After | Throughput |
|---------------|------------|------------|--------------|
| small | 7.168 µs | 7.147 µs | ~315 MiB/s |
| medium | 336.2 µs | 338.8 µs | ~298 MiB/s |
| large | 1.558 ms | 1.590 ms | ~300 MiB/s |
| starwars | 9.526 µs | 9.511 µs | ~417 MiB/s |
| github | 2.216 ms | 2.172 ms | ~537 MiB/s |
| shopify_admin | 3.996 ms | 3.959 ms | ~782 MiB/s |
#### Cross-parser comparison
| Fixture | libgraphql | graphql-parser | apollo-parser |
|---------------|------------|----------------|---------------|
| small | **29.6 µs** | 44.2 µs | 46.1 µs |
| medium | **1.68 ms** | 1.97 ms | 2.09 ms |
| large | **8.75 ms** | 9.11 ms | 9.83 ms |
| starwars | **34.7 µs** | 49.2 µs | 54.3 µs |
| github | **8.51 ms** | 8.56 ms | 12.1 ms |
| shopify_admin | **15.0 ms** | 16.8 ms | 24.8 ms |
| simple query | **1.77 µs** | 2.86 µs | 2.96 µs |
| complex query | **28.3 µs** | 39.0 µs | 38.1 µs |
**Verdict:** Clear improvement. Schema parsing improved 2.8-4.9% on description-heavy schemas (medium through shopify_admin). Executable parsing improved 1.7-2.8%. libgraphql-parser now leads graphql-parser on the github schema (8.51ms vs 8.56ms), closing the last remaining competitive gap.
---
## B21: `is_name_continue_byte()` lookup table [MEDIUM-HIGH]
**Status:** Completed
**Priority:** MEDIUM-HIGH
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Problem:** `is_name_continue_byte(b: u8) -> bool` uses `b == b'_' || b.is_ascii_alphanumeric()` which expands to multiple range checks. Called on every byte of every name in `lex_name()`'s tight loop. Names are the most frequent token type.
**Suggested fix:** Replace with a 256-byte `const` lookup table for O(1) branchless classification.
**Trade-offs:** 256 bytes of static data in the binary. Trivial cost — fits in a single L1 cache line pair and stays hot across the entire parse.
**Est. impact:** MEDIUM-HIGH — `lex_name()` is one of the lexer's hottest paths.
### Benchmark results (B21)
#### Lexer throughput (primary impact)
| Fixture | Before (B20) | After (B21) | Delta |
|---------------|--------------|-------------|---------|
| small | 7.147 µs | 7.042 µs | -1.4% |
| medium | 338.8 µs | 326.8 µs | -3.2% |
| large | 1.590 ms | 1.504 ms | -5.4% |
| starwars | 9.511 µs | 9.185 µs | -3.5% |
| github | 2.172 ms | 2.057 ms | -5.3% |
| shopify_admin | 3.959 ms | 3.727 ms | -5.9% |
Throughput: small ~320 MiB/s, medium ~309 MiB/s, large ~317 MiB/s, starwars ~432 MiB/s, github ~568 MiB/s, shopify_admin ~831 MiB/s.
#### Schema parsing (lean mode — most sensitive to lexer perf)
| Fixture | Before (B20) | After (B21) | Delta |
|---------------|--------------|-------------|---------|
| small | 18.74 µs | 18.40 µs | -1.8% |
| medium | 897.7 µs | 873.0 µs | -2.7% |
| large | 4.209 ms | 4.076 ms | -3.2% |
| starwars | 22.51 µs | 22.07 µs | -2.0% |
| github | 5.592 ms | 5.491 ms | -1.8% |
| shopify_admin | 10.71 ms | 10.22 ms | -4.6% |
#### Cross-parser comparison (after B21)
| Fixture | libgraphql | graphql-parser | apollo-parser |
|---------------|------------|----------------|---------------|
| small | **29.3 µs** | 43.5 µs | 45.4 µs |
| medium | **1.63 ms** | 1.91 ms | 2.01 ms |
| large | **8.38 ms** | 8.76 ms | 9.52 ms |
| starwars | **35.2 µs** | 49.0 µs | 54.0 µs |
| github | **8.50 ms** | 8.60 ms | 12.2 ms |
| shopify_admin | **14.9 ms** | 16.8 ms | 24.9 ms |
| simple query | **1.79 µs** | 2.85 µs | 2.94 µs |
| complex query | **28.8 µs** | 39.0 µs | 38.1 µs |
**Verdict:** Excellent improvement. Lexer throughput improved 3-6% across all benchmarks, with shopify_admin reaching 831 MiB/s (+6.2%). Lean schema parsing improved 2-5%. The lookup table eliminates multiple branch instructions per byte in `lex_name()`'s hot loop, replacing them with a single array-indexed load.
---
## B22: `parse_single_line_string()` fast path for no-escape strings [MEDIUM-HIGH]
**Status:** Reverted — no measurable improvement
**Priority:** MEDIUM-HIGH
**File:** `src/token/graphql_token_kind.rs`
**Problem:** `parse_single_line_string()` iterates every character via `chars().peekable()`, pushing each into a `String`. The vast majority of GraphQL strings contain no escape sequences, so a single `memchr` check + `memcpy` would suffice for the common case.
**Suggested fix:** Before the character loop, use `memchr::memchr(b'\\', content.as_bytes())` to check for backslashes. If none found, return `String::from(content)` directly — one allocation, one memcpy, done. When escapes are present, bulk-copy everything before the first backslash via `push_str(&content[..first_escape])`, then start the char-by-char loop only from the first escape onward.
**Trade-offs:** The `memchr` scan does useful work in both paths: in the fast path it confirms no escapes exist; in the slow path its result drives the bulk prefix copy. No wasted work in either case.
**Est. impact:** MEDIUM-HIGH — affects schema parsing benchmarks for description-heavy schemas.
**Benchmark results (vs post-B21 baseline):**
Two variants tested:
1. **Unconditional memchr** (memchr on all strings): Showed 1.5–4.8% regressions across lean parsers and lexer. The memchr setup cost exceeded savings for typical short GraphQL strings. Lexer regressions (which don't call this function) indicated code layout / I-cache effects from the larger function body.
2. **Length-guarded memchr** (`content.len() > 32`): Extracted long-string path into separate function to reduce code layout impact. Results were mixed/neutral vs baseline — schema_parse showed small improvements on some inputs (-1.0% to -1.8%) but schema_parse_lean showed +1.2% to +3.0% regressions. Deltas were within run-to-run variance (the same benchmark duplicated via compare_schema_parse disagreed with schema_parse on direction of change).
**Verdict:** Reverted. The optimization's theoretical benefit doesn't materialize in practice because: (a) most GraphQL strings in benchmark fixtures are short (<32 bytes), and (b) the compiler already optimizes the simple char-by-char loop effectively. The memchr setup overhead dominates for short strings, and for long strings the improvement is lost in noise.
---
## B23: `skip_whitespace()` lookup table [LOW-MEDIUM]
**Status:** Reverted — regression
**Priority:** LOW-MEDIUM
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Problem:** `skip_whitespace()` uses a 4-way `match` plus a BOM check on every byte. Called at the start of every lexer loop iteration. Already byte-scanning (from B2), but the match could be replaced with a lookup table to reduce branching.
**Suggested fix:** Use a 256-byte `const WHITESPACE_TABLE` for the main whitespace bytes. BOM handling stays as a special case (0xEF leadbyte is rare).
**Trade-offs:** The compiler may already optimize the current match into a similar form. This optimization may show no measurable improvement.
**Est. impact:** LOW-MEDIUM — called very frequently but processes few bytes per call.
**Benchmark results (vs post-B21 baseline, benchmark stopped early — regressions clear):**
| Category | Benchmark | Delta vs B21 |
|----------|-----------|-------------|
| schema_parse | large | -1.3% |
| schema_parse | github | -1.7% |
| executable_parse | simple | **+3.4%** ⚠️ |
| executable_parse | complex | **+1.6%** |
| executable_parse_lean | simple | **+5.0%** ⚠️ |
| executable_parse_lean | complex | **+3.5%** ⚠️ |
| executable_parse_lean | nested_10 | **+3.2%** ⚠️ |
| executable_parse_lean | many_ops | **+2.0%** |
| lexer | github | **+1.5%** |
| lexer | shopify_admin | **+1.5%** |
**Verdict:** Reverted. The compiler already optimizes the 4-way `match` on ASCII byte values into efficient code (likely a comparison chain or small jump table). Replacing it with a 256-byte lookup table added an extra memory indirection that hurt performance. The regression pattern — worse on executable documents (which have more whitespace-delimited tokens relative to their size) and on large lexer inputs — confirms the lookup table is slower than the compiler's native match optimization for this small set of 4 byte values.
---
## B24: `lex_number()` byte-scanning [LOW]
**Status:** Reverted — regression
**Priority:** LOW
**File:** `src/token_source/str_to_graphql_token_source.rs`
**Problem:** `lex_number()` uses `peek_char()`/`consume()` per digit across three scanning phases (integer, decimal, exponent). Numbers are typically short in GraphQL, but the per-character overhead adds up.
**Suggested fix:** Convert the digit-scanning loops to direct byte-scanning using `bytes[i].is_ascii_digit()` and batch `curr_byte_offset` update, consistent with `lex_name()`'s approach.
**Trade-offs:** More complex code for handling the various number phases. Unlike `memchr`-based optimizations, this is a constant-factor improvement (avoids method call overhead), not a SIMD improvement. Numbers are typically short (1–5 digits), limiting the payoff.
**Est. impact:** LOW — numbers are less frequent than names and strings in GraphQL.
**Benchmark results (vs post-B21 baseline):**
Consistent 1–3% regression across all benchmarks, with some cross-parser comparisons up to +5.6%. Numbers are rare in GraphQL schemas, so the optimization had almost no opportunity to show improvement. The changed function size caused I-cache layout perturbations that regressed neighboring hot paths — the same effect observed in B23.
| Benchmark | B21 baseline | B24 | Delta |
|---|---|---|---|
| schema_parse/small | 29.041 µs | 29.024 µs | -0.06% |
| schema_parse/medium | 1.6165 ms | 1.6153 ms | -0.07% |
| schema_parse/large | 8.3346 ms | 8.1871 ms | -1.77% |
| schema_parse/starwars | 35.629 µs | 35.376 µs | -0.71% |
| schema_parse/github | 8.5025 ms | 8.4452 ms | -0.67% |
| schema_parse/shopify_admin | 15.355 ms | 15.428 ms | +0.48% |
| exec_parse/simple_query | 1.8003 µs | 1.8188 µs | +1.03% |
| exec_parse/complex_query | 28.794 µs | 29.038 µs | +0.85% |
| exec_parse/nested_depth_10 | 6.0716 µs | 6.1138 µs | +0.70% |
| exec_parse/nested_depth_30 | 18.381 µs | 18.485 µs | +0.57% |
| exec_parse/many_ops_50 | 113.09 µs | 113.56 µs | +0.42% |
| schema_parse_lean/small | 18.397 µs | 18.521 µs | +0.67% |
| schema_parse_lean/medium | 872.98 µs | 889.98 µs | +1.95% |
| schema_parse_lean/large | 4.0757 ms | 4.1001 ms | +0.60% |
| schema_parse_lean/starwars | 22.067 µs | 22.332 µs | +1.20% |
| schema_parse_lean/github | 5.4914 ms | 5.4672 ms | -0.44% |
| schema_parse_lean/shopify_admin | 10.216 ms | 10.234 ms | +0.18% |
| exec_parse_lean/simple_query | 1.0764 µs | 1.0886 µs | +1.13% |
| exec_parse_lean/complex_query | 15.997 µs | 16.129 µs | +0.83% |
| exec_parse_lean/nested_depth_10 | 3.5065 µs | 3.5452 µs | +1.10% |
| exec_parse_lean/nested_depth_30 | 10.948 µs | 11.046 µs | +0.90% |
| exec_parse_lean/many_ops_50 | 65.439 µs | 65.124 µs | -0.48% |
| lexer/small_schema | 7.0416 µs | 6.9992 µs | -0.60% |
| lexer/medium_schema | 326.75 µs | 327.90 µs | +0.35% |
| lexer/large_schema | 1.5039 ms | 1.5223 ms | +1.22% |
| lexer/starwars_schema | 9.1849 µs | 9.3260 µs | +1.54% |
| lexer/github_schema | 2.0566 ms | 2.0951 ms | +1.87% |
| lexer/shopify_admin_schema | 3.7270 ms | 3.7764 ms | +1.32% |
| compare/libgraphql/small | 29.296 µs | 29.876 µs | +1.98% |
| compare/libgraphql/medium | 1.6263 ms | 1.6634 ms | +2.28% |
| compare/libgraphql/large | 8.3773 ms | 8.5371 ms | +1.91% |
| compare/libgraphql/starwars | 35.244 µs | 35.897 µs | +1.85% |
| compare/libgraphql/github | 8.5024 ms | 8.7623 ms | +3.06% |
| compare/libgraphql/shopify_admin | 14.918 ms | 15.754 ms | +5.60% |
| compare_exec/libgraphql/simple | 1.7905 µs | 1.8180 µs | +1.54% |
| compare_exec/libgraphql/complex | 28.768 µs | 29.342 µs | +1.99% |
**Conclusion:** Reverted. The byte-scanning rewrite changed the function size enough to perturb I-cache alignment of adjacent hot functions (`lex_name`, `skip_whitespace`, `lex_string`). Since numbers are rare in GraphQL, the per-digit scanning speedup was negligible while the code layout effect caused widespread ~1–3% regressions. This is the same phenomenon observed in B23.