ktav 0.1.4

Ktav — a plain configuration format. Three rules, zero indentation, zero quoting. Serde-native.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
# Performance notes — `ktav` Rust crate

This document collects every principle, library choice, and pattern used to
get the Rust implementation of Ktav to its current speed. It is written
against the state as of commit `ceb8184` (phase 6) and draws on the full
commit history from the initial import forward.

The crate was accelerated from a naïve first cut to the current numbers
across seven perf rounds. Target fixture: `parse_to_struct/100_upstreams_typed`
(22 KB of Ktav with 100 nested records). Typical representative moves:

| Round                | Snapshot impact on typed deserialize |
|----------------------|--------------------------------------|
| initial import       | baseline                             |
| phase 1              | indent, validation, borrowed multiline |
| phase 2              | `itoa` / `ryu` (serializer hot path) |
| phase 3              | `CompactString` in Value              |
| phase 4              | capacity hints, `IndexMap::entry`, fast paths |
| phase 5              | **bump-arena for `ThinValue`: −34 %** |
| phase 6              | memchr shortcut on multiline terminator |

The cumulative effect is large: deserialize time dropped from the first
internal measurement (~**~1 ms** on an early revision) to ~**207 µs** —
roughly a **5× improvement**, without a single `unsafe` block in `src/`.

The remainder of this document explains *why* each move helped, *how* it
interacts with the rest of the code, and — equally important — **which
seemingly obvious optimizations were measured and rejected**.

---

## 1. Philosophy: measure, don't guess

Every number in this doc comes from Criterion long-sample benchmarks (50
samples × 5 s measurement, 2 s warmup). The failed-experiment list at the
end of this doc is a long one: `scan_line` (unified forward scan), byte-match
classification, `Bump::with_capacity`, custom integer parsers, `lto = "fat"`
for the bench profile. Each of those *looked* like it should help; each of
them measured worse or flat.

The working rules:

1. **Land one change at a time**, bench it against the previous commit as a
   criterion baseline, decide to commit or revert on measured signal — not
   on "it should be faster because…".
2. **A uniform direction across all benchmarks is signal**; a mixed bag is
   almost always noise. Thermal drift and rebuild layout shifts routinely
   move individual benches by ±5 % between runs.
3. **Trust the standard library.** `str::trim`, `str::find`, memchr-backed
   `<[u8]>::contains`, and `str::parse::<u16>` all ship with SIMD/highly
   tuned implementations. Hand-rolled forward scans that do "the same work
   in one pass" routinely lose because LLVM can't vectorize them.
4. **Invariants beat early exits.** If the pipeline guarantees input is
   already `trim_end`'d, downstream code should not call `trim()` again.
   Thread the invariant through types/comments; save the bytes.
5. **No `unsafe`.** A stated project rule (`src/` grep is clean — the only
   `unsafe` tokens are in comments describing code that used to exist).
   Every allocation trick below is in safe Rust.

---

## 2. Memory principles

### 2.1 Arena allocation (bumpalo) — the single biggest win

The `from_str<T>` path used to allocate per-compound: every `ThinValue::Array`
and `ThinValue::Object` held a heap `Vec`, every multi-line string held a
heap `String`, every multi-line collecting buffer was another `Vec<&str>`.
For a 100-record input that worked out to ~500 heap allocations per call,
with matching frees at the end.

**Phase 5** moved the whole zero-copy path into a `bumpalo::Bump`:

```rust
// src/lib.rs
pub fn from_str<T: DeserializeOwned>(s: &str) -> Result<T> {
    let bump = bumpalo::Bump::new();
    let thin = thin::parse_thin(s, &bump)?;
    T::deserialize(thin::ThinDeserializer::new(thin))
}
```

Inside `thin/parser.rs`, `Frame::Object { pairs, .. }` and
`Frame::Array { items }` hold `bumpalo::collections::Vec<'a, _>`. Subvectors
are allocated via `BumpVec::with_capacity_in(4, bump)`. Multi-line content
that has to be re-assembled goes through `Bump::alloc_str`. When `from_str`
returns, one `Drop` on the `Bump` frees every byte in constant time.

Measured effect: **−34 % on `parse_to_struct`, −29 % on `roundtrip`,
−10 % on `multiline_dedent/*`** vs. phase 4.

**Caveat (phase 11 reject):** `Bump::with_capacity(s.len())` — pre-sizing
the arena to the input — *hurt* by 7–27 %. Lazy chunk growth amortizes
cheaply; an up-front large chunk forces the allocator to hand out a
first-class large block before anyone asks for it.

### 2.2 Zero-copy borrowing: `ThinValue<'a>`

`ThinValue` carries slices straight out of the input buffer wherever the
text form is a contiguous substring of that buffer:

```rust
// src/thin/value.rs
pub enum ThinValue<'a> {
    Null,
    Bool(bool),
    Integer(&'a str),
    Float(&'a str),
    Str(&'a str),
    Array(BumpVec<'a, ThinValue<'a>>),
    Object(BumpVec<'a, (&'a str, ThinValue<'a>)>),
}
```

Key design decisions:

* **Object keys are `&'a str`** — no `String` wrapper. Dotted paths
  (`a.b.c`) produce three slices into the same buffer via
  `str::split('.')`, still zero-allocation.
* **No `Cow`.** Before phase 5 `Integer`/`Float`/`Str` were
  `Cow<'a, str>`; phase 5 collapsed them to `&'a str`. Normalization cases
  (leading `+` strip, multi-line dedent) copy into the arena via
  `Bump::alloc_str`, and the returned slice is still `&'a str`.
* **The deserializer prefers `visit_borrowed_str`**, so structs declared
  as `String` get a single copy at the serde boundary and arena slices
  declared as `&'de str` skip even that.

### 2.3 Inline short strings: `CompactString`

The owned `Value` path (`parse()` → `Value`) keeps scalar text in
`compact_str::CompactString`. It is `Deref<Target = str>`, so callers treat
it as a `String`, but:

* Strings ≤ 24 bytes on 64-bit live *inline* in the 24-byte struct — no
  heap allocation for port numbers, identifiers, short hostnames, booleans-
  as-string.
* Longer strings fall back to a heap allocation transparently.

Used pervasively:

```rust
// src/value/value.rs
pub type Scalar = CompactString;

// src/value/object_map.rs
pub type ObjectMap = IndexMap<Scalar, Value, FxBuildHasher>;
```

Phase 3 swapped `String → CompactString` across `Value`, the parser, and
the serializer. The landed rule: never hold scalar text as `String` if
`CompactString` can substitute; pay the heap allocation only when the
scalar is genuinely long.

### 2.4 Capacity hints, everywhere

`Vec::new()` and `String::new()` don't allocate — that's the right default
for conditionally populated collections. But when the eventual size is
known-even-loosely, hinting removes one or more doubling reallocations:

```rust
// Owned parser (src/parser/frame.rs)
Frame::Object { pairs: ObjectMap::with_capacity_and_hasher(4, FxBuildHasher), ... }

// Owned parser Collecting (src/parser/collecting.rs)
Self { mode, lines: Vec::with_capacity(8) }

// Thin parser (src/thin/parser.rs)
pairs: BumpVec::with_capacity_in(4, bump)
```

The serializer opens the output buffer at 2 KB instead of 256 bytes; for a
22 KB target that removes **seven** doubling reallocs:

```rust
// src/ser/text_serializer.rs
pub fn to_string<T: ?Sized + Serialize>(value: &T) -> Result<String> {
    let mut out = String::with_capacity(2048);
    ...
}
```

**Size-hint source:** when serde hands us `len: usize` (struct field count,
tuple len, known-len map), use it:

```rust
// src/ser/value_serializer.rs
fn serialize_struct(self, _name: &'static str, len: usize) -> Result<StructSerializer> {
    Ok(StructSerializer {
        entries: IndexMap::with_capacity_and_hasher(len, FxBuildHasher),
    })
}
```

**Caveat:** the phase-5-arena path uses `with_capacity_in(4, bump)`, not
`new_in(bump)`. Phase 7 tested swapping to `new_in` (lazy) — it helped
owned-parse slightly but *hurt* thin parse by 5 % because typical objects
in the test fixtures have 4–6 keys, so reserving 4 up front skips a
doubling.

---

## 3. Library choices

| Crate | Role | Why |
|---|---|---|
| `bumpalo` (`3`, `collections` feature) | Arena for thin-path temporaries | One drop frees everything; `BumpVec` for arena-rooted growable vectors |
| `compact_str` (`0.9`, `serde` feature) | Inline short-string storage | 24 inline bytes covers typical config scalars with zero heap traffic |
| `indexmap` (`2`, `serde` feature) | Ordered `Scalar → Value` maps | Preserves struct-field order through ser/de; supports `entry()` API |
| `rustc-hash` (`2`) | `FxBuildHasher` for `IndexMap` | Faster than `DefaultHasher`; Ktav is a config parser, not hash-flood-resistant |
| `itoa` (`1`) | Integer → text in serializer | Skips `fmt::Formatter` overhead |
| `ryu` (`1`) | `f32`/`f64` → text | Shortest-round-trip float representation, fast |
| `serde` (`1`, `derive`) | De/serialization | Required contract |

Dev-only: `criterion` for measurement, `serde_json` (`preserve_order` +
`arbitrary_precision`) for conformance oracles.

**Why `IndexMap` and not `HashMap`:** `HashMap` would destroy struct field
order, which breaks round-trip serialization. `IndexMap` gives O(1) lookup
*and* insertion-order iteration for the price of a parallel `Vec` of
indices. Under `FxBuildHasher` it's faster than `HashMap` with the default
SipHash.

**Why `FxBuildHasher`:** SipHash is hash-flood-resistant at the cost of
throughput. A configuration parser is not exposed to adversarial inputs in
the threat model this crate targets; Fx gives a measurable speedup and
deterministic iteration order.

---

## 4. Code patterns

### 4.1 Fast paths for the common case

The hottest code path picks the dominant input shape and handles it without
going through the full generality. Two examples.

**Non-dotted keys** — in Ktav you *can* write `a.b.c: 10` but you rarely
do. Split the insertion function so the dotted case is a fallback and the
non-dotted case is a flat series of checks:

```rust
// src/thin/parser.rs
fn insert_pair<'a>(
    target: &mut BumpVec<'a, (&'a str, ThinValue<'a>)>,
    path: &'a str,
    value: ThinValue<'a>,
    line_num: usize,
    bump: &'a Bump,
) -> Result<()> {
    // Fast path: no dotted segments — the vast majority of inserts.
    if !path.as_bytes().contains(&b'.') {
        if !is_valid_key(path) { ... }
        if target.iter().any(|(k, _)| *k == path) { ... /* dup */ }
        target.push((path, value));
        return Ok(());
    }
    insert_at_path(target, path, value, line_num, path, bump)
}
```

The owned parser mirrors this in `src/parser/insert.rs`.

**`IndexMap::entry()` over `contains_key` + `insert`** — the old code took
two hash lookups per insertion (check, then insert). The `entry` API
consolidates to one:

```rust
// src/parser/insert.rs
match table.entry(path.into()) {
    Entry::Occupied(_) => Err(/* duplicate */),
    Entry::Vacant(v) => { v.insert(value); Ok(()) }
}
```

For dotted insertion the gain is larger — the naïve code did
`contains_key`, `insert`, `get_mut` (three lookups); `entry().or_insert_with`
does it in one.

### 4.2 Invariant-tracking to avoid redundant work

The parser used to call `trim()` three or four times per line:
`raw.trim()` at the top of `handle_line`, then `line[..colon].trim()`,
then `after.trim()`, then `classify()` would call `text.trim()` again.
Each trim is two byte scans (start + end).

Phase 4 fixed each redundant call by threading the invariant through:

* `handle_line` does `raw.trim()` once.
* Downstream receives already-trimmed input; comments and types document
  the invariant.
* Functions that need only leading- or trailing-whitespace cleanup use
  `trim_start`/`trim_end` — not the full `trim()`.

Typical doc-comment:

```rust
/// `trimmed` MUST be trim_start'ed (no leading whitespace). Trailing
/// whitespace has been removed earlier in the pipeline (raw.trim() at
/// the top of handle_line). Don't trim again here — we'd pay O(len)
/// for a no-op on the hot path.
fn classify<'a>(trimmed: &'a str, line_num: usize) -> Result<ValueStart<'a>> { ... }
```

### 4.3 `memchr`-backed shortcut for rare-byte checks

Phase 6: inside a multi-line string, every line used to do `raw.trim()` *
compare against `")"` or `"))"`. Most multi-line content has no `)` byte
at all. `<[u8]>::contains(&b')')` is memchr-backed — a single SIMD scan
that returns false in microseconds for typical payload lines. Guard the
expensive work with it:

```rust
// src/thin/parser.rs (phase 6)
if let Some(ref mut collecting) = self.collecting {
    if raw.as_bytes().contains(&b')') {
        let trimmed = raw.trim();
        let term = match collecting.mode { ... };
        if trimmed == term { /* close */ }
    }
    collecting.lines.push(raw);
    return Ok(());
}
```

All ten benchmarks moved in the same direction (−0.7 % to −7.6 %) — a
small but clean win, precisely because the shortcut is trivially correct
and stdlib does the heavy lifting.

### 4.4 Static-slice pool for constant content

The serializer writes indentation (4 spaces per level) tens of thousands of
times in a large render. The first implementation did `out.extend(iter::repeat(' ').take(n))`,
which LLVM compiles to a per-byte loop. The second made the same writes
through `unsafe { out.as_mut_vec() }.extend_from_slice(...)`. The third —
`unsafe`-free and faster — is a pre-built all-spaces `const` string sliced
into the output via `push_str`:

```rust
// src/render/helpers.rs
pub(super) fn push_indent(out: &mut String, level: usize) {
    const SPACES: &str =
        "                                                                "; // 64
    let mut remaining = level * INDENT.len();
    if remaining == 0 { return; }
    out.reserve(remaining);
    while remaining > 0 {
        let chunk = remaining.min(SPACES.len());
        out.push_str(&SPACES[..chunk]);
        remaining -= chunk;
    }
}
```

`push_str` compiles to `memcpy`, which is SIMD-friendly. The same pattern
lives in `src/ser/text_serializer.rs` (`write_indent`).

### 4.5 `#[cold]` / `#[inline(never)]` for slow paths

When a function has a hot branch and a rarely-taken slow branch, splitting
the slow part out with `#[cold] #[inline(never)]` keeps the hot path free
of unrelated code (better icache and branch prediction):

```rust
// src/render/helpers.rs
pub(super) fn needs_raw_marker(s: &str) -> bool {
    match s.as_bytes().first() {
        None => false,
        Some(&b' ') | Some(&b'\t') => needs_raw_marker_slow(s.trim_start()),
        Some(&b'{') | Some(&b'[') => true,
        Some(_) => matches!(s, "null" | "true" | "false" | "(" | "((" | "()" | "(())"),
    }
}

#[cold]
#[inline(never)]
fn needs_raw_marker_slow(t: &str) -> bool {
    t.starts_with('{') || t.starts_with('[')
        || matches!(t, "null" | "true" | "false" | "(" | "((" | "()" | "(())")
}
```

The hot path — 99 % of scalars that don't start with whitespace — branches
on one byte and returns. The whitespace-prefix case pays the cost of a
full `trim_start` and gets a proper check, but that path is marked cold so
it lives elsewhere in the binary.

### 4.6 Byte-level validation

Key validation rejects ASCII whitespace and a small set of reserved bytes.
UTF-8 decoding is overkill for this — the forbidden set is pure ASCII, so
iterate bytes directly:

```rust
// src/parser/validate.rs
#[inline]
pub(super) fn is_valid_key(k: &str) -> bool {
    !k.is_empty()
        && !k.as_bytes().iter().any(|&b| {
            b.is_ascii_whitespace()
                || matches!(b, b'[' | b']' | b'{' | b'}' | b':' | b'#')
        })
}
```

`b.is_ascii_whitespace()` is a byte test; `matches!` compiles to a bitmask.
The old `char::is_whitespace` over `s.chars()` decoded UTF-8 for every
byte, and also *matched* Unicode whitespace (NBSP etc.), which the Ktav
spec explicitly allows.

### 4.7 Write-directly-into-buffer serialization

An earlier serializer did `let s = format!(...); out.push_str(&s);` for
each field. Commit `60d57d9` replaced every such call with
`write!(out, "{...}")` or the direct typed helpers:

```rust
// src/ser/text_serializer.rs
fn push_int_pair<I: itoa::Integer>(out: &mut String, v: I) {
    out.push_str(":i ");
    let mut buf = itoa::Buffer::new();
    out.push_str(buf.format(v));
    out.push('\n');
}
```

`itoa::Buffer` is a fixed-size `[u8; 40]` inside the caller's stack — no
allocation — and `Buffer::format` is the specialized integer-to-ASCII
routine. `ryu::Buffer` plays the same role for floats. This was phase 2.

### 4.8 `Cow` at boundaries (before arena)

Pre-phase-5 thin path kept `Cow<'a, str>` in `ThinValue` — borrowed when
the text was a buffer slice, owned when it had to be normalized. Phase 5
removed the `Cow` (arena took over the "owned when normalized" case), but
the pattern remains in `finalize_multiline` conceptually: single-line
cases return a direct buffer borrow; multi-line cases copy into the arena.

The principle: **have one place in the pipeline where `Borrowed` vs
`Owned` is decided, not a boolean sprinkled through the code.**

### 4.9 Single-line fast paths in multi-line finalization

A multi-line string with exactly one content line is common enough to be
worth a special case — both because it saves the `Vec::join` allocation
and because for `stripped` mode the single line's `trim_start` is the
final answer:

```rust
// src/thin/parser.rs
fn finalize_multiline<'a>(c: Collecting<'a>, bump: &'a Bump) -> &'a str {
    match c.mode {
        MultilineMode::Verbatim if c.lines.len() == 1 => c.lines[0],
        MultilineMode::Verbatim => { /* join into arena */ }
        MultilineMode::Stripped if c.lines.len() == 1 => {
            let only = c.lines[0];
            if only.trim().is_empty() { "" } else { only.trim_start() }
        }
        MultilineMode::Stripped => { /* dedent into arena */ }
    }
}
```

---

## 5. Serde integration

### 5.1 Deserializer prefers borrowed visits

The thin deserializer always routes to `visit_borrowed_str` when the
content is arena- or input-rooted:

```rust
// src/thin/deserializer.rs
fn deserialize_str<V: Visitor<'de>>(self, visitor: V) -> Result<V::Value> {
    match self.value {
        ThinValue::Str(s) | ThinValue::Integer(s) | ThinValue::Float(s) => {
            visitor.visit_borrowed_str(s)
        }
        ...
    }
}
```

A caller asking for `&'de str` gets it zero-copy; a caller asking for
`String` still gets one copy at the boundary (serde's `String` visitor
copies a borrowed `&str` once), but no extra copy to reach the visitor.

Same applies to `deserialize_bytes` → `visit_borrowed_bytes`.

### 5.2 Borrowed keys through a mini-deserializer

Struct-field lookup compares the declared field name to the incoming key
string. The thin path exposes keys as `&'de str`, via a purpose-built
deserializer that forwards everything relevant to `visit_borrowed_str`:

```rust
// src/thin/deserializer.rs
struct BorrowedStrDeserializer<'de> {
    value: &'de str,
}

impl<'de> Deserializer<'de> for BorrowedStrDeserializer<'de> {
    fn deserialize_str<V: Visitor<'de>>(self, visitor: V) -> Result<V::Value> {
        visitor.visit_borrowed_str(self.value)
    }
    ...
}
```

Serde's default `IntoDeserializer<'_, _>` for `String` forces an allocation
per key; the borrowed variant skips it entirely.

### 5.3 `visit_str` vs `visit_string` on `CompactString`

Subtle: on the owned-`Value` path, scalar text lives in `CompactString`.
Owned-path `deserialize_str` could call `visitor.visit_string(s.into_string())`,
but that forces an allocation for every inline-short scalar. Instead:

```rust
// src/de/value_deserializer.rs
fn deserialize_str<V: Visitor<'de>>(self, visitor: V) -> Result<V::Value> {
    match &self.value {
        Value::String(s) | Value::Integer(s) | Value::Float(s) => visitor.visit_str(s),
        ...
    }
}
```

`visit_str(&s)` hands a borrowed `&str` — serde's own `String` visitor
allocates *once* if needed, *zero times* for inline variants where the
caller took `&'de str`. The old `visit_string(s.into_string())` cost a
heap allocation per scalar regardless.

### 5.4 `deserialize_any` still surfaces numeric types

Typed markers (`:i 123`) should arrive in `serde_json::Value` as an
integer, not a string. `deserialize_any` tries the numeric coercion first:

```rust
// src/thin/deserializer.rs
ThinValue::Integer(s) => {
    if let Ok(i) = s.parse::<i64>() { visitor.visit_i64(i) }
    else if let Ok(u) = s.parse::<u64>() { visitor.visit_u64(u) }
    else { visitor.visit_borrowed_str(s) } // arbitrary-precision fallback
}
```

Fallback preserves arbitrary-precision literals as strings — never silently
drops digits.

---

## 6. What didn't work, and why (the reject pile)

These were all measured; each has a commit history or a documented reason
they were rolled back. They are *more* useful than the commits that
landed: they draw the line between "natural-looking optimization" and
"actual optimization".

### 6.1 Unified forward line-scanner (phase 6, reverted)

Idea: replace `raw.trim()` + `trimmed.find(':')` + `"}"` / `"]"` equality
checks with one forward pass that tracks start, end, and first colon in
one loop. Looks like a clear win — fewer passes, same information.

**Measured:** `parse_to_value` **+14 % to +58 %** across benches. Reverted.

**Why it lost:** `str::trim` and `str::find` are memchr/SIMD backed in
stdlib. They scan at L1-cache speed. A hand-written forward loop with
*two* conditionals per byte (`== b':'`, `is_ascii_whitespace`) defeats
vectorization — LLVM can't prove the scans are independent, so it emits
scalar code. Two SIMD passes beat one scalar pass by a factor of three.

**Lesson:** when the stdlib already has a tight primitive for your scan,
don't replace it with a "more efficient" loop that fuses two different
checks. Call the primitives back-to-back; the cost is nearly free.

### 6.2 Byte-match fast-out in `classify` (phase 9, reverted)

Idea: if the first byte of the trimmed value isn't `{`/`[`/`(`, skip the
whole chain of `== "{"`, `starts_with('{')`, …, and return `Scalar` right
away.

**Measured:** +1 % to +5 % across `parse_to_value`, neutral on
`parse_to_struct`. Reverted.

**Why it lost:** LLVM was already constant-folding and reordering the
cascade. The extra explicit match arm adds code and occasionally defeats
the optimizer's reorder, costing more than it saves.

### 6.3 `Bump::with_capacity(text.len())` (phase 11, reverted)

Idea: Pre-size the bump arena so it never has to grow.

**Measured:** +7 % to +27 % across benches. Reverted.

**Why it lost:** `Bump::new()`'s default chunk size and doubling
strategy are tuned. Handing it a single large chunk up-front forces
a big allocator call before there's a user for any of it, and
interacts badly with per-call reuse patterns on warm allocators.

### 6.4 `BumpVec::new_in(bump)` instead of `with_capacity_in(4, bump)` (phase 7, reverted)

Idea: Lazy allocation — don't reserve anything until the first `push`.

**Measured:** `parse_to_value` improved 2–10 %, but
`parse_to_struct` regressed 5 %. Reverted on the typed-path regression.

**Why it lost:** Typical objects in the benchmark have 4–6 keys.
`with_capacity_in(4)` skips one doubling (0 → 4); `new_in` takes the
full 0 → 4 → 8 sequence on the hot path.

### 6.5 In-place dedent in `bumpalo::String` (phase 8, reverted)

Idea: Build the dedented multi-line string directly in a `bumpalo::String`,
skipping the `std::String` → `Bump::alloc_str` copy chain.

**Measured:** `multiline_dedent/*` **+6 % to +15 %**. Reverted.

**Why it lost:** `bumpalo::String` has slightly more bookkeeping per
`push_str` than `std::String` (alignment, chunk boundary checks). Over
many pushes the overhead exceeds the single terminating copy saved.

### 6.6 Custom `parse_u64_fast` / `parse_i64_fast` (phase 13, reverted)

Idea: Hand-rolled byte loops for `u8/u16/u32/u64/i8/i16/i32/i64`, falling
through to `str::parse` on anything unusual.

**Measured:** `parse_to_struct` **±0.0 %**. Reverted.

**Why it lost:** `str::parse::<u16>` is already tight — a few dozen
instructions, no allocation. LLVM inlined it. The hand-rolled version
did exactly the same work.

### 6.7 `lto = "fat"` for the bench profile (not committed)

Idea: Enable fat LTO just for benches.

**Measured:** `parse_to_value` **+40–60 %** (regression), but `render` saw
`−23 %`. Net negative across the suite.

**Why it lost:** Fat LTO aggressively inlines across crate boundaries,
which in a parser swells the hot function past L1 icache capacity. The
render path is smaller and benefits; the parse path gets punished by
instruction-cache pressure. Kept out of `Cargo.toml`; LTO belongs in a
separate, deliberately-scoped experiment.

### 6.8 `scan_line`-based optimizations for array items (phase 10)

Only applicable if `scan_line` from phase 6 had landed — since it didn't,
this never materialized. Its concept (skip colon-search in array context)
is *already* realized in the current code: `handle_array_item` doesn't
call `find(':')` at all.

---

## 7. Benchmarking methodology

### 7.1 Criterion defaults

```bash
cargo bench --bench parse -- \
    --save-baseline <name> \
    --warm-up-time 2 \
    --measurement-time 5 \
    --sample-size 50
```

Five-second measurement with 50 samples and two-second warmup is the
`*_precise` profile used for phase-to-phase comparisons.

### 7.2 Short-sample is for draft signal only

A 30-second run (`--warm-up-time 1 --measurement-time 2 --sample-size 20`)
is fast but carries ±10 % noise. Useful as a "did this completely break?"
sanity check. Never commit-or-revert a marginal change (±5 %) on a
short-sample result — rerun long.

### 7.3 Ablation for layered changes

When a phase bundles several independent patches, revert files one group
at a time and rebench each permutation. The Phase 4 ablation (in the
commit history) showed that one of its three groups gave most of the
signal and another was flat; that's only visible through ablation.

### 7.4 Comparing across commits

Criterion writes numbered baselines under `target/criterion/<bench>/<name>/`.
The JSON under `estimates.json` has `mean.point_estimate` in nanoseconds;
diffing two baselines is a one-liner:

```bash
for bench in parse_to_value/100_upstreams/22KB parse_to_struct/100_upstreams_typed ...; do
  a=$(grep -o 'point_estimate":[0-9.]*' "target/criterion/$bench/arena_precise/estimates.json" | head -1 | grep -o '[0-9.]*$')
  b=$(grep -o 'point_estimate":[0-9.]*' "target/criterion/$bench/phase6_memchr/estimates.json" | head -1 | grep -o '[0-9.]*$')
  awk "BEGIN{printf \"$bench: %+.1f%%\n\", ($b-$a)/$a*100}"
done
```

### 7.5 Thermal and rebuild noise

Individual bench results move ±5 % between adjacent long-sample runs even
with no code change. Causes:

* CPU thermal state (first run hotter than third on a laptop).
* Minor code changes trigger full crate recompile with different inline
  layout; bench binaries from different cargo runs have different TLB
  behavior.

A rule of thumb: **a single-digit percentage delta on one bench is
always noise. A uniform direction across ≥8 benches is signal.** That's
why phase 6 was committed (all ten went down) and phase 9 was rejected
(net mixed, within noise).

---

## 8. The rules, distilled

1. **Arena over per-node heap** for short-lived object trees.
2. **Borrow before owning**`&'a str` and `Cow` at boundaries.
3. **Inline short strings**`CompactString` for 24-byte fast path.
4. **FxHash + IndexMap** — fast hashing + insertion-order iteration.
5. **Capacity hints** at every collection creation where size is
   loosely knowable.
6. **One lookup per insert**`entry().or_insert_with`, not
   `contains_key` + `insert`.
7. **Fast paths** for the dominant shape; dotted/complex cases branch
   off first.
8. **memchr shortcuts** to skip expensive `trim`/`compare` chains.
9. **Byte-level scanning** for ASCII-only grammars; never decode UTF-8
   for rejection-class checks.
10. **Static slice pools** for constant writes in hot loops
    (`const SPACES`).
11. **`#[cold]` + `#[inline(never)]`** for rarely-taken slow paths to
    free the hot icache.
12. **`itoa` / `ryu`** for number formatting; never `write!("{}", ...)`
    inside a hot loop.
13. **Invariant-threading** — document that input is already trimmed
    and don't re-trim; match `trim_end`/`trim_start` to the real need.
14. **`visit_borrowed_str` / `visit_borrowed_bytes`** on deserialize
    paths; a borrowed-key mini-deserializer for struct fields.
15. **`visit_str(&s)` over `visit_string(s.into_string())`** for
    inline-short string storage.
16. **Trust the standard library**. `str::trim`, `str::find`, `<[u8]>::contains`,
    `str::parse::<T>` are SIMD/tight-asm tuned. Replacing them with
    "unified" hand-rolled scans almost always regresses.
17. **Measure, don't guess**. Uniform direction across benches = signal;
    mixed bag = noise. Revert on flat results.
18. **No `unsafe`.** Every principle above is implementable in safe
    Rust; if an optimization seems to need `unsafe`, there's almost
    always a safe equivalent at the same speed or faster
    (`write_indent` went from `unsafe { as_mut_vec() }` to the
    `const SPACES` chunk loop and got slightly faster).

---

## 9. What's left (the wall)

Future gains require either:

* **Breaking the public `Value` API** to arena-back the owned parser too
  (est. +10–15 % on `parse()`), or
* **Custom-derive macros** bypassing serde's virtual dispatch for
  hot types (est. +15–30 % on `from_str<T>`), or
* **PGO** (est. +5–15 %, compile-time only).

Each of those is either API-breaking or a structural shift. Inside the
current "prose" of the crate — `#[derive(Serialize, Deserialize)]` over
safe Rust — the measured optimization surface is empirically exhausted.