asmjson 0.2.5

A fast JSON parser using AVX-512/AVX2/SWAR classifiers
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
# asmjson — development conversation log

This file captures the development history of the `asmjson` project as it
unfolded over two coding sessions.  The earlier session is reconstructed from
a conversation summary; the later session (JsonWriter) is recorded in full.

---

## Session 1 — SIMD classifiers, benchmarks, perf profiling

### Resuming from prior state

The project was `asmjson`, a Rust JSON parser at `/home/amy/andy-thomason/asmjson`
using AVX-512BW SIMD classification.  Prior work had added XMM/YMM/ZMM
classifier variants, CPUID dispatch, benchmarks, and a standalone `.s`
assembly file.

### Fixing orphaned code

Removed orphaned `next_state_xmm` body, added `static ZMM_CONSTANTS`, fixed
Rust 2024 `unsafe {}` blocks inside `unsafe fn imp` bodies.  Tests passed
14/14.

### ISA question: broadcast memory operands

Confirmed that AVX-512 broadcast memory operands (`{1to16}`) do not exist for
byte-granularity instructions (`vpcmpeqb` / `vpcmpub`), so 64-byte needle
vectors in `ZMM_CONSTANTS` are the correct approach.

### Benchmark results (criterion)

Compared XMM / YMM / ZMM variants plus simd-json on three workloads:

| workload      | XMM       | YMM       | ZMM       |
|---------------|-----------|-----------|-----------|
| string_array  | 4.16 GiB/s| 5.53 GiB/s| 6.08 GiB/s|
| string_object ||||
| mixed         ||||

State machine dominates on mixed workload.

### Standalone GNU assembly file

Created `src/classify_zmm.s`, compiled via `build.rs` + `cc` crate.  Fixed
SysV sret convention for 32-byte `ByteState` return.  Added `#[repr(C)]` to
`ByteState`.  Added to benchmarks and `classifier_agreement` test.  14/14
passing.

### Perf profiling results

Used `perf record` / `perf report` to compare `classify_zmm_s` (`.s` file)
vs inline-asm `classify_zmm`:

- `.s` version: ~16% of runtime (8.16% body + 7.87% Rust wrapper) — sret
  call overhead for 32-byte return value.
- Inline-asm version: ~10% (folded into `parse_json_impl` by the compiler).
- State machine + allocator dominate at 68–74%.

Conclusion: allocator / tree-building is the real bottleneck, motivating a
flat `Tape` output.

### Revert standalone `.s`

Removed `classify_zmm_gnu`, `build.rs`, `cc` build-dep, bench slots.  14/14
tests still passing.

### Final state of session 1

- `cargo fmt` applied.
- Committed as **`dbf274e`**.
- `push_value` / `close_frame` still present; `parse_json_impl` still builds
  `Vec<Frame>` directly.

---

## Session 2 — JsonWriter trait and Tape output

### Motivation

The perf profile showed that allocation for the `Value` tree is the dominant
cost.  A flat `Tape` representation would let callers avoid or defer
allocation.  Rather than duplicating the parser, we abstract output
construction behind a trait.

### Design

**`JsonWriter<'src>` trait** — SAX-style event sink:

```rust
pub trait JsonWriter<'src> {
    type Output;
    fn null(&mut self);
    fn bool_val(&mut self, v: bool);
    fn number(&mut self, s: &'src str);
    fn string(&mut self, s: Cow<'src, str>);
    fn key(&mut self, s: Cow<'src, str>);
    fn start_object(&mut self);
    fn end_object(&mut self);
    fn start_array(&mut self);
    fn end_array(&mut self);
    fn finish(self) -> Option<Self::Output>;
}
```

**`FrameKind` enum** — lightweight parser-internal discriminant replacing
`Vec<Frame>` in the state machine loop:

```rust
enum FrameKind { Object, Array }
```

**`ValueWriter<'a>`** — private struct implementing `JsonWriter<'a>` with
`Output = Value<'a>`.  Delegates to the existing `push_value` helper.

**`TapeEntry<'a>`** — flat token type:

```rust
pub enum TapeEntry<'a> {
    Null,
    Bool(bool),
    Number(&'a str),
    String(Cow<'a, str>),
    Key(Cow<'a, str>),
    StartObject(usize),   // payload = index of matching EndObject
    EndObject,
    StartArray(usize),    // payload = index of matching EndArray
    EndArray,
}
```

`StartObject(n)` / `StartArray(n)` carry the index of their matching closer,
enabling O(1) structural skips:

```rust
if let TapeEntry::StartObject(end) = tape.entries[i] {
    i = end + 1; // jump past the entire object
}
```

**`Tape<'a>`** — output struct:

```rust
pub struct Tape<'a> { pub entries: Vec<TapeEntry<'a>> }
```

**`TapeWriter<'a>`** — private struct implementing `JsonWriter<'a>` with
`Output = Tape<'a>`.  Maintains an `open: Vec<usize>` of unmatched
`StartObject` / `StartArray` indices that are backfilled when the closer
is emitted.

**`write_atom`** — helper replacing the old `parse_atom` + `push_value`
callsites:

```rust
fn write_atom<'a, W: JsonWriter<'a>>(s: &'a str, w: &mut W) -> bool { … }
```

### New public API

| Symbol | Description |
|--------|-------------|
| `pub trait JsonWriter<'src>` | SAX-style writer trait |
| `pub enum TapeEntry<'a>` | flat token |
| `pub struct Tape<'a>` | flat token sequence |
| `pub fn parse_to_tape(src, classify) -> Option<Tape>` | flat output path |
| `pub fn parse_with(src, classify, writer) -> Option<W::Output>` | generic entry point |
| `pub fn parse_json(src, classify) -> Option<Value>` | **unchanged** |

### Internal changes

- `parse_json_impl` is now generic:
  `fn parse_json_impl<'a, F, W>(src, classify: F, writer: W) -> Option<W::Output>`
- The parser loop uses `Vec<FrameKind>` instead of `Vec<Frame>`.
- `parse_atom` and `close_frame` removed (dead code).

### Implementation

Inserted all new types between `close_frame` and the old `pub fn parse_json`,
then replaced `parse_json_impl` with the generic version.

### Tests

9 new tape tests added:

- `tape_scalar_values` — null, bool, number, string
- `tape_empty_object``StartObject(1)` points to `EndObject` at index 1
- `tape_empty_array`
- `tape_simple_object``{"a":1}`
- `tape_simple_array``[1,2,3]`
- `tape_nested``{"a":[1,2]}`, verifies both skip indices
- `tape_multi_key_object``{"x":1,"y":2}`
- `tape_invalid_returns_none` — trailing commas, bad structure
- `tape_skip_object` — exercises the O(1) skip idiom

All classifiers (XMM / YMM / ZMM) are compared for each tape test.

**23/23 tests pass, zero warnings.**

### Commit

```
00c27c4  Add JsonWriter trait + Tape output
```

---

## Session 3 — Tape throughput benchmark

### What was done

Added an `asmjson/zmm/tape` bench slot to the `string_array` criterion group
(`benches/parse.rs`) so that `parse_to_tape` (flat Tape output) can be
directly compared against `parse_json` (Value tree output), both using the ZMM
classifier.

### Results

Workload: ~10 MiB array of 95-character ASCII strings.

| variant | throughput |
|---|---|
| `asmjson/zmm``parse_json``Value` tree | 6.25 GiB/s |
| `asmjson/zmm/tape``parse_to_tape``Tape` | 8.56 GiB/s |

The flat Tape is **~37% faster** on this workload.  The gain comes almost
entirely from eliminating the per-element heap allocation required to build the
`Vec<Value>` inside `Value::Array` and the `Box<[...]>` at close time.  The
SIMD classifier and state machine costs are identical between the two paths.

### Design decisions

The Tape bench was added only to `string_array` (the allocation-heavy
workload) rather than to all three groups, keeping the benchmark run time
reasonable.  The same pattern can be replicated for `string_object` and
`mixed` when needed.

### Commit

```
3b1f4b2  bench: add asmjson/zmm/tape to string_array group
```
---

## Session 4 — string_object Tape benchmark

### What was done

Added an `asmjson/zmm/tape` slot to the `string_object` criterion group to
compare `parse_json` (Value tree) vs `parse_to_tape` (flat Tape) on the
object-heavy workload.

### Results

Workload: ~10 MiB flat JSON object with string keys (`"keyNNNNN"`) and
85-character ASCII string values.

| variant | throughput |
|---|---|
| `asmjson/zmm``parse_json``Value` tree | 5.29 GiB/s |
| `asmjson/zmm/tape``parse_to_tape``Tape` | 5.53 GiB/s |

Only **~5% faster**, compared to 37% on the string array.

### Design decisions / analysis

The much smaller gain reflects the structure of the workload.  Each object
member requires a key parse (KeyChars → KeyEnd → AfterColon states) that is
identical in both paths — the Tape still emits a `Key` entry for every member.
On the Value side, the `Vec<(Cow, Value)>` members accumulation is the main
allocation cost; on the Tape side that is replaced by a flat `Vec<TapeEntry>`
push, but the state-machine work per byte is the same.

In contrast, the string array workload allocates a `Box<[Value]>` per
top-level array (containing ~100 k `Value::String` variants), which the Tape
eliminates entirely.

### Commit

```
c1fb9d4  bench: add asmjson/zmm/tape to string_object group
```
---

## Session 5 — mixed Tape benchmark

### What was done

Added an `asmjson/zmm/tape` slot to the `mixed` criterion group to compare
`parse_json` vs `parse_to_tape` on the deeply-nested mixed workload.

### Results

Workload: ~10 MiB array of objects, each with numbers, booleans, nulls, a
nested array, and a nested object (~130 bytes per record).

| variant | throughput |
|---|---|
| `asmjson/zmm``parse_json``Value` tree | 254 MiB/s |
| `asmjson/zmm/tape``parse_to_tape``Tape` | 392 MiB/s |

**~54% faster** with the Tape on this workload.

### Analysis

The mixed workload allocates at multiple nesting levels: an outer `Box<[Value]>`
for the top-level array, and inside each record a `Box<[...]>` for the `tags`
array and the `meta` object, plus the record object itself.  Every `}` / `]`
triggers a heap allocation to box the collected members.  The Tape avoids all
of this — it is a single flat `Vec<TapeEntry>` grown incrementally with no
per-close allocation.

The absolute throughput (254 / 392 MiB/s) is much lower than on the
string-only workloads (5–8 GiB/s) because the mixed data has short strings and
dense structural characters, so the state machine visits more states per byte.

### Overall Tape speedup summary

| workload | Value tree | Tape | speedup |
|---|---|---|---|
| string_array | 6.25 GiB/s | 8.56 GiB/s | +37% |
| string_object | 5.29 GiB/s | 5.53 GiB/s | +5% |
| mixed | 254 MiB/s | 392 MiB/s | +54% |

### Commit

```
8edf785  bench: add asmjson/zmm/tape to mixed group
```
```

---

## Session 6 — JsonRef read-only accessor trait

### Motivation

Having both a `Value` tree and a flat `Tape` as parse outputs created an
ergonomics problem: code consuming parsed JSON had to hardcode which
representation to use.  The request was to model `serde_json::Value`'s
accessor API as a trait so that generic functions work with either.

### Design

`pub trait JsonRef<'a>: Sized + Copy` — `'a` is the string-access lifetime;
`&'a str` returned from `as_str` / `as_number_str` is valid for at least `'a`.

Methods mirror `serde_json::Value`:

| method | notes |
|---|---|
| `is_null / is_bool / is_number / is_string` | default impls via `as_*` |
| `is_array / is_object` | required |
| `as_null / as_bool / as_number_str / as_str` | required |
| `as_i64 / as_u64 / as_f64` | default: `as_number_str()?.parse().ok()` |
| `get(key: &str) -> Option<Self>` | object key lookup |
| `index_at(i: usize) -> Option<Self>` | array positional lookup |
| `len() -> Option<usize>` | element / pair count |

#### TapeRef

A new `pub struct TapeRef<'t, 'src: 't>` carries `tape: &'t [TapeEntry<'src>]`
and `pos: usize` using two lifetimes:

- `'t`   = borrow of the tape's `Vec` (typically the caller's stack frame).
- `'src` = the source JSON bytes lifetime (the data borrowed inside entries).

This avoids the self-referential `&'src Tape<'src>` pattern.

`Tape::root<'t>(&'t self) -> Option<TapeRef<'t, 'src>>` is the entry point.

A private `fn tape_skip(entries, pos) -> usize` advances past one entry in O(1)
for `StartObject` / `StartArray` (using the pre-baked end-index payload) and
also O(1) for scalars.

### Implementation

~300-line insertion in `src/lib.rs` between `TapeWriter` impl and `write_atom`:

1. `pub struct TapeRef<'t, 'src: 't>` + `#[derive(Clone, Copy)]`
2. `impl<'src> Tape<'src> { pub fn root<'t>` }
3. `fn tape_skip` (private)
4. `pub trait JsonRef<'a>` with full docstrings
5. `impl<'a> JsonRef<'a> for &'a Value<'a>`
6. `impl<'t, 'src: 't> JsonRef<'t> for TapeRef<'t, 'src>`

### Tests

Six new tests added (29 total): `jsonref_scalars_value/tape`, `jsonref_object_get`,
`jsonref_array_index`, `jsonref_nested`, `jsonref_generic_fn` (exercises a
`fn<'a, J: JsonRef<'a>>(J) -> Option<i64>` on both representations).  All pass.

### Commit

```
9b5f27c  feat: add JsonRef trait + TapeRef cursor
```

---

## Session 7 — JsonRef chaining via Option<J>

### Motivation

`x.get("a").get("b")` was broken by the original trait design: `get` returned
`Option<Self>`, so calling `.get("b")` on `Option<J>` would have to return
`Option<Option<J>>`, defeating flat chaining.

### Design decision: associated `type Item`

The fix is to add `type Item: JsonRef<'a>` to the trait and change `get` /
`index_at` to return `Option<Self::Item>` instead of `Option<Self>`.

| impl | `type Item` | effect |
|---|---|---|
| `&'a Value<'a>` | `Self` | no change |
| `TapeRef<'t,'src>` | `Self` | no change |
| `Option<J>` | `J::Item` | chain stays flat |

The key insight: `Option<J>::Item = J::Item` (not `Option<J>`), so chaining
never wraps more deeply.

```rust
root.get("address").get("city").as_str()
root.get("items").index_at(0).get("val").as_i64()
```

### Implementation

- Added `type Item: JsonRef<'a>` to `JsonRef` trait definition.
- Changed `fn get` / `fn index_at` signatures to return `Option<Self::Item>`.
- Added `type Item = Self` to both concrete impls (no change in practice).
- Fixed `Option<J>` impl: `type Item = J::Item`, `get`/`index_at` delegate via
  `self?.get(key)` returning `Option<J::Item>`.

### Test

`jsonref_option_chaining`: three-level `.get().get().get()` on both `&Value`
and `TapeRef`, missing-key short-circuit, mixed `.get().index_at().get()`.
30 tests passing.

### Commit

```
413c41f  feat: impl JsonRef for Option<J> with type Item for flat chaining
```

## Session 8 — Submodule split: `value`, `tape`, `json_ref`

### What was done

Split the monolithic `src/lib.rs` into three submodules:

| Module | File | Contents |
|---|---|---|
| `value` | `src/value.rs` | `Value<'a>`, `ValueWriter`, `is_valid_json_number`, `push_value`, `Frame` |
| `tape` | `src/tape.rs` | `TapeEntry`, `Tape`, `TapeWriter`, `TapeRef`, `tape_skip`, `Tape::root` |
| `json_ref` | `src/json_ref.rs` | `JsonRef` trait + impls for `&'a Value`, `TapeRef`, `Option<J>` |

Each module carries its own `#[cfg(test)] mod tests { … }` block with the
tests relevant to that module.  `lib.rs` retains only the parse engine
(classifier functions, `parse_json_impl`, `JsonWriter`, `FrameKind`,
`write_atom`) plus a single `classifier_agreement` test.

Public API is unchanged: `lib.rs` re-exports all moved types via
`pub use value::Value`, `pub use tape::{Tape, TapeEntry, TapeRef}`, and
`pub use json_ref::JsonRef`.

### Design decisions

- `ValueWriter` and `TapeWriter` are `pub(crate)` so `lib.rs` can pass them to
  `parse_with`; their constructors are also `pub(crate)`.
- `is_valid_json_number` is `pub(crate)` so `lib.rs`'s `write_atom` can call it.
- `TapeRef`'s fields (`tape`, `pos`) are `pub(crate)` so `json_ref.rs` can
  implement the `JsonRef` accessor methods without the impl living in `tape.rs`.
- `tape_skip` is `pub(crate)` for the same reason.
- Each submodule's test helpers (`run`, `run_tape`, `run_both`) are duplicated
  locally; they are private and small enough that sharing is unnecessary.

### Results

30/30 tests pass across all four test modules; zero warnings after removing
three unused imports that surfaced during the move.

### Commit

`4781b13` refactor: split into submodules value, tape, json_ref

## Session 9 — portable SWAR classifier

### Add `classify_u64`

#### What was done

Added `pub fn classify_u64(src: &[u8]) -> ByteState`, a pure-Rust classifier
that processes a 64-byte block as eight `u64` words using SIMD-Within-A-Register
(SWAR) tricks, requiring no architecture-specific intrinsics.

`choose_classifier()` was updated so that `classify_u64` is the universal
fallback returned when not running on x86-64 (which continues to return the
AVX-512 / AVX2 / SSE2 path as before).

The `classifier_agreement` integration test was extended to assert that
`classify_u64` produces the same `ByteState` as `classify_zmm` for every
test input.  `classify_u64` was also added as `asmjson/u64` to all three
benchmark groups in `benches/parse.rs`.

#### Design decisions

**Whitespace detection** (`byte <= 0x20`):

```
masked = v & 0x7f7f_7f7f_7f7f_7f7f  // clear bit 7 before add
sum    = masked + 0x5f5f_5f5f_5f5f_5f5f  // overflows into bit 7 iff byte >= 0x21
w      = !(sum | v) & 0x8080_8080_8080_8080
```

Masking bit 7 before the add prevents bytes ≥ 0x80 from aliasing into the
target range.  OR-ing the original `v` then ensures bytes ≥ 0x80 are excluded
from the final result.

**Byte equality** — XOR with a broadcast constant turns the problem into
"detect a zero byte":

```
has_zero_byte(v) = (v - 0x0101...) & !v & 0x8080...
eq_byte(v, b)    = has_zero_byte(v ^ (b * 0x0101...))
```

**Movemask** — collects the MSB of each byte into a `u8`:

```
((v & 0x8080...) * 0x0002_0408_1020_4081) >> 56
```

The magic multiplier routes bit 7 of byte *k* (at position `8k+7`) into
bit `56+k` of the product; shifting right 56 leaves the eight flags in the
low byte.

**Zero-padding** — bytes beyond `src.len()` are zero-filled, which the
whitespace test classifies as whitespace — consistent with the behaviour of
the SIMD classifiers.

#### Results

30/30 tests pass; zero warnings.  `cargo bench --no-run` compiles cleanly.

#### Commit

`54979e0` feat: add classify_u64 portable SWAR classifier

## Session 10 — remove classify_xmm

### What was done

Benchmarked `classify_xmm` (SSE2) against `classify_u64` (SWAR) and found
xmm is slower on every workload:

| Workload      | xmm       | u64       |
|---------------|-----------|-----------|
| string\_array | 3.03 GiB/s | 5.95 GiB/s |
| string\_object | 2.72 GiB/s | 4.20 GiB/s |
| mixed         | 229 MiB/s  | 234 MiB/s  |

`classify_xmm` was removed from `src/lib.rs`, `choose_classifier` updated
(AVX-512BW → AVX2 → portable SWAR u64, no SSE2 step), bench entries
removed, and all submodule test helpers (`value.rs`, `tape.rs`,
`json_ref.rs`) updated to cross-check `classify_u64` instead of
`classify_xmm`.

### Design decisions

SSE2 `classify_xmm` processes the 64-byte block as four 16-byte passes,
each incurring a `VPMOVMSKB` movemask with cross-lane serialisation
overhead.  The portable SWAR implementation works entirely in GP registers
as eight independent 64-bit word operations, avoiding that bottleneck
entirely.  Since the portable code wins unconditionally there is no reason
to maintain the SSE2 path — any x86-64 chip that lacks AVX2 now falls
straight through to `classify_u64`.

YMM (AVX2) was checked simultaneously: u64 leads on string-heavy input
(+12%) while ymm recovers on object-heavy input (+4%).  Net mixed result
means ymm still earns its place as the AVX2 hardware path.

### Results

30/30 tests pass; zero warnings.

### Commit

`c6bbb9b` refactor: remove classify_xmm (slower than classify_u64 on all benchmarks)

---

## Session 12 — Fix CI (AVX-512 compile and runtime guards)

### What was done

GitHub Actions CI was never running because all commits were local only (16
commits ahead of `origin`).  After pushing, CI triggered but would have
failed on two related issues in `classify_zmm`:

1. **Compile-time**: The AVX-512BW inline-assembly block inside `classify_zmm`
   lacked a `#[target_feature(enable = "avx512bw")]` attribute.  LLVM's
   integrated assembler rejects AVX-512 mnemonics (`vmovdqu8`, `vpcmpub`,
   `kmovq`, etc.) when the function's target-feature set does not include
   `avx512bw`.  GitHub's `ubuntu-latest` runners compile with the default
   `x86_64-unknown-linux-gnu` target (no AVX-512), so the build would have
   errored out.

2. **Runtime**: The `classifier_agreement` test called `classify_zmm`
   unconditionally.  On hardware without AVX-512 this triggers `SIGILL`.

### Design decisions

Following the same pattern already used by `classify_ymm`, the AVX-512 asm was
moved into a nested `unsafe fn imp` annotated with
`#[target_feature(enable = "avx512bw")]`.  The outer `classify_zmm` delegates
to `imp` via `unsafe { imp(src) }`.  This is safe because the only callers are
`choose_classifier` (guarded by `is_x86_feature_detected!("avx512bw")`) and
the test (now also guarded).

In the test, the zmm comparison block was wrapped in
`#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] if is_x86_feature_detected!("avx512bw")`.
When AVX-512 is absent the test still cross-checks `classify_u64` against
`classify_ymm`, preserving meaningful coverage on all runners.

### Results

30/30 tests pass locally; doc-tests pass.  CI will now compile and run
successfully on `ubuntu-latest` (AVX2 available, AVX-512 absent).

### Commit

`b5c7265` fix: guard classify_zmm and test behind avx512bw target-feature

---

## Session 13 — Add sonic-rs to benchmarks

### What was done

Added `sonic-rs = "0.5.7"` as a dev-dependency and added a
`sonic_rs::from_str::<sonic_rs::Value>` bench variant to all three groups
(`string_array`, `string_object`, `mixed`).  Ran the full bench suite and
updated the README table with sonic-rs results and refreshed numbers.

### Design decisions

`sonic_rs::from_str::<sonic_rs::Value>` is the closest analogue to
`parse_json` — it produces a fully-navigable value tree from a `&str`.
`sonic-rs` uses a lazy `Value` representation where string content remains as
raw bytes in the source buffer; escape processing is deferred until the value
is read.  By contrast, asmjson fully decodes `\uXXXX` / `\\` / `\"` escapes
into `Cow<'src, str>` during the initial parse pass, which is safer and more
ergonomic but costs throughput on string-heavy inputs.

### Results

| Parser              | string array | string object | mixed     |
|---------------------|:------------:|:-------------:|:---------:|
| sonic-rs            | 11.0 GiB/s   | 6.17 GiB/s    | 969 MiB/s |
| asmjson zmm (tape)  | 8.36 GiB/s   | 5.72 GiB/s    | 383 MiB/s |
| asmjson zmm         | 6.09 GiB/s   | 5.23 GiB/s    | 262 MiB/s |
| asmjson u64         | 6.08 GiB/s   | 4.20 GiB/s    | 255 MiB/s |
| asmjson ymm         | 5.45 GiB/s   | 4.46 GiB/s    | 258 MiB/s |
| simd-json borrowed  | 2.13 GiB/s   | 1.32 GiB/s    | 189 MiB/s |
| serde_json          | 2.50 GiB/s   | 0.57 GiB/s    |  92 MiB/s |

sonic-rs leads on string-heavy work because of its lazy decode.  On mixed
JSON (numbers, bools, nested objects), asmjson zmm/tape is still 2.5× faster
than sonic-rs — likely because mixed workloads require more structural parsing
where sonic-rs's lazy trick gives less advantage.

### Commit

`ee28983` bench: add sonic-rs comparison

---

## Session 14 — TapeRef::object_iter and array_iter

### What was done

Added two new iterator types to `src/tape.rs` and inherent methods on
`TapeRef` to create them:

- **`TapeObjectIter<'t, 'src>`** — yields `(&'t str, TapeRef<'t, 'src>)` pairs
  for every key-value entry in a JSON object, in document order.  Returned by
  `TapeRef::object_iter()`, which returns `None` if the cursor is not on a
  `StartObject` entry.

- **`TapeArrayIter<'t, 'src>`** — yields one `TapeRef<'t, 'src>` per array
  element in document order.  Returned by `TapeRef::array_iter()`, which
  returns `None` if the cursor is not on a `StartArray` entry.

Both types were added to the crate-root re-exports.

### Design decisions

The iterators are inherent methods on `TapeRef` rather than part of the
`JsonRef` trait because the `JsonRef` trait is generic (`type Item`) and
returning an iterator type directly from the trait would require either
associated types for the iterator types (adding trait complexity) or
`impl Trait` returns (not stable in traits without GATs boilerplate).  Keeping
them as inherent methods is simpler and zero-cost.

Both iterators advance via `tape_skip`, so skipping over nested
objects/arrays inside a value position is O(1) — the `StartObject(end)` and
`StartArray(end)` payloads let the iterator jump directly to the next sibling.

### Results

32 unit tests + 5 doc-tests pass; zero warnings.

### Commit

`eb5de55` feat: add TapeRef::object_iter and array_iter

## Session 17 — Remove Value type

### Remove `Value<'a>` and `parse_json`

**What was done**

The `Value<'a>` tree type, `parse_json` entry point, `ValueWriter`, and all
supporting code in `src/value.rs` were removed.  The `Tape` / `TapeRef` path
is now the sole output format.

Specifically:

- `src/value.rs` deleted (`git rm`).
- `src/lib.rs`: removed `pub mod value`, `pub use value::Value`,
  `use value::{ValueWriter, is_valid_json_number}`, and the `parse_json`
  function + its doc-test.  `is_valid_json_number` (previously in
  `value.rs`) was moved inline into `lib.rs` as a private function, since it
  is still needed by `write_atom`.
- `src/json_ref.rs`: removed `use crate::value::Value`, the
  `impl JsonRef<'a> for &'a Value<'a>` block, and the `&'a Value<'a>` bullet
  from the trait's doc comment.  Test module rewritten: `fn run()` and
  `fn run_both()` helpers deleted; all tests that exercised both `&Value` and
  `TapeRef` paths were updated to use only `run_tape()`.  The
  `jsonref_scalars_value` test was removed entirely.
- `benches/parse.rs`: the `#[cfg(feature = "stats")]` `print_stats` helper
  was updated to alias `parse_to_tape` as `parse_json` so that the
  `#[cfg(feature = "stats")]` gate continues to compile.
- `README.md`: quick-start example updated to use `parse_to_tape`; Output
  formats list trimmed to two entries.

**Design decisions**

`Value` was a convenient heap-allocated tree that mirrored `serde_json::Value`,
but benchmarks showed it was always slower than the tape and the codebase now
focuses on flat-tape output.  Removing it simplifies the public API and
eliminates ~500 lines of code.

`is_valid_json_number` is still needed at parse time (in `write_atom`) so it
was migrated to `lib.rs` rather than deleted; it remains private.

**Results**

18 unit tests + 4 doc-tests pass; zero warnings.  5 files changed,
69 insertions, 590 deletions.

**Commit**

`cbb1e6b` Remove Value type and parse_json; tape is the only output format

## Session 18 — Benchmark refresh (March 2026)

### Results

Re-ran `cargo bench` with `RUSTFLAGS="-C target-cpu=native"`.  asmjson now
leads sonic-rs on all three workloads:

| Parser             | string array | string object | mixed      |
|--------------------|:------------:|:-------------:|:----------:|
| asmjson zmm (tape) | 8.20 GiB/s   | 5.48 GiB/s    | 370 MiB/s  |
| sonic-rs           | 7.37 GiB/s   | 4.21 GiB/s    | 368 MiB/s  |

### Design decisions

README table and accompanying prose updated to reflect the new leader, and
stale references to simd-json, serde_json, and the removed asmjson Value
variants were removed.

### Commit

`63d6957` bench: update README with March 2026 results (asmjson leads sonic-rs)

## Session — TapeEntry: split Cow into borrowed + escaped variants

### What was done

Replaced the two `Cow<'a, str>` payload variants in `TapeEntry`:

| Before | After |
|--------|-------|
| `String(Cow<'a, str>)` | `String(&'a str)` + `EscapedString(Box<str>)` |
| `Key(Cow<'a, str>)` | `Key(&'a str)` + `EscapedKey(Box<str>)` |

`TapeWriter::string` / `TapeWriter::key` now branch on the `Cow` variant from
the parser: `Borrowed` goes into the plain variant; `Owned` (escape-decoded)
is converted to `Box<str>` and stored in the `Escaped*` variant.

`TapeObjectIter`, `json_ref::as_str`, and `json_ref::get` were extended to
match both the plain and escaped variants.

### Design decisions

`Box<str>` (ptr + len = 16 bytes) was chosen over `String` (ptr + len + cap =
24 bytes) because the decoded string is never grown after allocation; dropping
the capacity word is the right trade-off.

An alternative was to keep `Cow` on the `JsonWriter` trait and only change
`TapeEntry`.  This was the approach taken: the trait signature is untouched,
keeping the door open for alternative `JsonWriter` impls that may prefer the
`Cow` abstraction.

### Results

`size_of::<TapeEntry>()` reduced from **32 bytes** to **24 bytes** (25%
reduction).  All 18 unit tests and 4 doc-tests continue to pass.

## Session — JsonWriter: replace Cow methods with string/escaped_string and key/escaped_key

### What was done

Split the two `Cow`-taking methods on the `JsonWriter` trait into four
explicit methods:

| Before | After |
|--------|-------|
| `fn string(&mut self, s: Cow<'src, str>)` | `fn string(&mut self, s: &'src str)` |
| | `fn escaped_string(&mut self, s: Box<str>)` |
| `fn key(&mut self, s: Cow<'src, str>)` | `fn key(&mut self, s: &'src str)` |
| | `fn escaped_key(&mut self, s: Box<str>)` |

`parse_json_impl` now dispatches directly on the `str_escaped` flag and calls
the appropriate method instead of allocating a `Cow`.  The `current_key: Cow`
local was replaced by `current_key_raw: &'a str` + `current_key_escaped: bool`.
The `use std::borrow::Cow` import was removed from `lib.rs`.

`TapeWriter` was simplified to four one-liner push calls.

### Design decisions

Having separate methods at the trait level means `JsonWriter` implementors no
longer need to import or pattern-match `Cow`.  A `Box<str>` is the minimal
allocation for the decoded text (no spare capacity), consistent with the
`TapeEntry` representation.

### Results

All 18 unit tests and 4 doc-tests continue to pass.

## Session — Zero-allocation parse_json_impl fast path

### What was done

Eliminated the two remaining heap allocations from the non-escaping path of
`parse_json_impl`:

**Frames stack**: replaced `Vec<FrameKind>` with a caller-supplied
`&mut [FrameKind; 64]` and a `frames_depth: usize` cursor.  `push` / `pop` /
`last` / `is_empty` are now simple array-index operations.  Nesting beyond 64
levels returns `State::Error`.  `FrameKind` gained `#[derive(Copy, Clone,
PartialEq)]` to enable the array semantics.

**Unescape buffer**: replaced the `unescape_str(s) -> String` helper
(which allocated a fresh `String` then a second time for `into_boxed_str`)
with `unescape_str(s, out: &mut String)` that reuses a caller-supplied buffer.
Each escaped value now performs exactly one allocation (`Box::from(buf.as_str())`).

`parse_with` (the public entry point) allocates both resources on its own
stack frame and passes them down, so the public API is unchanged.

`unescape_str` is now `#[unsafe(no_mangle)]` + `#[inline(never)]` and `pub`,
giving it a stable C-linkage symbol for profiling or external calls.

### Design decisions

64 levels of nesting covers all realistic JSON; deeply-nested pathological
inputs are rejected as errors.  The `String` reuse avoids the
`String::with_capacity` allocation on every escape-containing token while
still producing a proper `Box<str>` for the `TapeEntry`.

### Results

All 18 unit tests and 4 doc-tests pass.  The hot path (no escape sequences)
now allocates zero bytes inside `parse_json_impl` itself.


---

## Session 3 — Hand-written AVX-512BW assembly translation

### What was done

Created `asm/x86_64/parse_json_zmm_dyn.s` — a complete hand-written GNU
assembler translation of the `parse_json_impl` state machine.

Two preparatory changes were also made to `src/lib.rs`:

- `FrameKind` received `#[repr(u8)]` with explicit discriminants
  `Object = 0` and `Array = 1`, giving a stable ABI for the assembly.
- A thin `is_valid_json_number_c` wrapper was added with
  `#[unsafe(no_mangle)] pub extern "C"` linkage so it can be called from
  assembly without name-mangling.

### Design decisions

**Direct threading** — each state ends with an unconditional `jmp` to the
next state label.  No integer state variable is stored anywhere; the
program counter encodes the state.  A pair of registers (`r10` = resume
address, `r11` = EOF-handler address) is loaded just before every
`jmp .Lchunk_fetch`, so the shared fetch block can service every state
with a final `jmp r10`.

**Inlined classify_zmm** — the AVX-512BW classification (six
`vpcmpub`/`vpcmpeqb` instructions + three `korq` merges + four `kmovq`
extracts) is inlined at `.Lchunk_fetch`.  Constants live in `.rodata` as
six 64-byte lanes matching the `ByteStateConstants` layout.

**Register allocation** — five callee-saved GP registers carry persistent
state across the entire function:

| Register | Purpose |
|----------|---------|
| `rbx`    | writer data pointer (fat-ptr data half) |
| `r12`    | `src_base` |
| `r13`    | `src_end` |
| `r14`    | writer vtable pointer (fat-ptr vtable half) |
| `r15`    | `frames_buf` (`&mut [u8; 64]`) |

`rcx` holds `chunk_offset` inside the inner loop and is saved to
`[rbp-168]` (LOC_COFF) across every vtable call.

**Vtable offsets** — the 15-entry `dyn JsonWriter` vtable is documented
at the top of the file with byte offsets +0 through +112, derived from the
Rust fat-pointer convention (drop/size/align first, then methods in
declaration order).

**EOF handling** — each state provides its own `r11` EOF target set just
before the refetch jump.  States where EOF is legal (top-level whitespace,
after a complete value) land at `.Leof_after_value`; all others land at
`.Lerror`.

### Results

All 18 unit tests and 4 doc-tests continue to pass after `cargo fmt &&
cargo test`.  The assembly file is not yet linked into the crate but is
provided for inspection, benchmarking, and future FFI integration.

### Commit

`8cbce74` — asm: add x86_64 AVX-512BW direct-threading JSON parser

### Inline .Lcheck_inner_end — direct jumps to next state

**What was done**: Removed the `.Lcheck_inner_end` trampoline label. The
trampoline was a shared 4-instruction block (`cmp rcx, chunk_len; jae
.Lchunk_fetch; jmp r10`) reached by all 26 state-transition sites after
loading `r10`/`r11`.

Each site was rewritten to:

```asm
    lea     r11, [rip + .Leof_handler]
    cmp     rcx, qword ptr [rbp + LOC_CHUNK_LEN]
    jb      .Lnext_state              ; fast path: direct jump, r10 untouched
    lea     r10, [rip + .Lnext_state] ; slow path: set r10 for refetch
    jmp     .Lchunk_fetch
```

`r10` is now loaded only when the chunk is actually exhausted; the fast
path jumps directly to the target state without touching `r10` at all.
The refetch labels (`.Lrefetch_*`) are unchanged since they always feed
`.Lchunk_fetch` and still set `r10`.

**Design decisions**: The `jb` (jump-if-below) form avoids a negated
comparison.  `r11` is set unconditionally so that if `.Lchunk_fetch`
later hits EOF it always has a valid handler, regardless of which path
was taken.

**Results**: Zero references to `.Lcheck_inner_end` in the file.  File
grew from 1080 to 1124 lines (net +44 from expanding 26 × 3-line blocks
to 5 lines each, minus the deleted 10-line trampoline).

**Commit**: `e0e1993` — asm: inline .Lcheck_inner_end; use direct jb to next state

---

## Session 5 — Rust entrypoint for the zmm-dyn assembly parser

### Adding `parse_to_tape_zmm_dyn`

**What was done**: Added a public Rust function `parse_to_tape_zmm_dyn<'a>(src: &'a str) -> Option<Tape<'a>>` that drives the hand-written AVX-512BW assembly parser (`parse_json_zmm_dyn`) and returns the same `Tape` type as the pure-Rust entrypoints.

The work spanned several sub-problems that had to be solved before the doctest `assert_eq!(tape.root().get("x").as_i64(), Some(1))` passed.

### Build system: compiling the assembly

A `build.rs` was created to compile `asm/x86_64/parse_json_zmm_dyn.S` with the `cc` crate (added to `[build-dependencies]`).  The `.s` file was renamed to `.S` (uppercase) so that the C pre-processor runs first and strips `//` comments before GAS assembles the file — GAS in AT&T mode treats `//` as division.

### Assembly correctness fixes

Three assembly bugs were found and fixed before turning to the Rust side:

1. **Three-register addressing**: GAS does not allow `[r12+rax+rcx]` in Intel syntax.  Nine sites were fixed with `add rax, r12` followed by the two-register form.
2. **Wrong shift register**: `shl rax, cl` used `cl` (the chunk-offset byte of `rcx`) instead of the chunk length from `rsi`.  Fixed by inserting `mov ecx, esi` before the shift.
3. **Uninitialised `LOC_CHUNK_LEN`**: The first call to `.Lchunk_fetch` read an uninitialised stack slot.  Fixed by zero-initialising it in the prologue.

### Replacing the raw Rust dyn-vtable with a stable C-ABI vtable

**Design decisions**: The initial approach passed a raw Rust `dyn JsonWriter` fat-pointer vtable to the assembly, which assumed fixed byte offsets (24, 32, 40, …) for each method.  Rust's `dyn` vtable layout is implementation-defined (the header contains size, align, and a destructor before the first method slot), so those offsets are unstable and differed from reality.

The fix replaces the raw vtable with a `#[repr(C)] struct ZmmVtab` whose eleven fields are `unsafe extern "C"` function pointers at predictable 8-byte-aligned offsets (0, 8, 16, …).  Rust fills this struct on the stack with concrete trampoline functions, and the assembly uses matching `.equ VTAB_*` constants starting at 0.

Each trampoline casts `data: *mut ()` to `*mut TapeWriter<'static>` (the `'static` is a white-lie required because `extern "C"` functions cannot carry generic lifetime parameters; safety is upheld because the writer and source JSON both outlive the assembly call).  Trampolines for `escaped_string` and `escaped_key` copy the unescaped bytes into a fresh `Box<str>` to give proper ownership semantics.

All raw-pointer dereferences are wrapped in `unsafe {}` blocks to satisfy the Rust 2024 `unsafe_op_in_unsafe_fn` requirement.

### Fixing r8/r9 clobbering in `.Lemit_atom`

**What was done**: In `.Lea_number`, the atom pointer and length were saved into `r8`/`r9` before calling `is_valid_json_number_c`.  Both registers are caller-saved and were clobbered by the call, so the subsequent `mov rsi, r8 / mov rdx, r9` loaded garbage, causing the number vtable method to receive wrong arguments.

Fixed by saving pointer and length to the stack (`LOC_ATOM_START` / `LOC_STR_START`, which are stale at this point) and reloading from those slots after the validation call.

**Results**: All 18 unit tests and 5 doctests pass with zero warnings.  The doctest `assert_eq!(tape.root().get("x").as_i64(), Some(1))` passes correctly.

**Commit**: `944d97f` — feat: add parse_to_tape_zmm_dyn Rust entrypoint with C-ABI vtable

---

## Session 6 — Benchmarking `parse_to_tape_zmm_dyn`

### What was done

Added `asmjson/zmm_dyn` as a Criterion benchmark case in all three existing groups (`string_array`, `string_object`, `mixed`) in [benches/parse.rs](benches/parse.rs), gated on `#[cfg(target_arch = "x86_64")]` so it is silently skipped on other platforms.

### Results (10 MiB inputs, release build, x86_64)

| group         | asmjson/zmm   | asmjson/zmm_dyn | asmjson/u64  | sonic-rs      |
|---------------|---------------|-----------------|--------------|---------------|
| string_array  | 8.48 GiB/s    | 7.95 GiB/s      | 6.88 GiB/s   | 7.15 GiB/s    |
| string_object | 5.77 GiB/s    | 5.47 GiB/s      | 4.68 GiB/s   | 4.08 GiB/s    |
| mixed         | 451 MiB/s     | 445 MiB/s       | 448 MiB/s    | 484 MiB/s     |

`zmm_dyn` is ~6–8 % slower than the pure-Rust `zmm` path on the string-heavy workloads.  The overhead comes from the two extra indirect calls (through `ZmmVtab`) per parsed token compared with the inlined fast paths in the Rust state machine.  On the mixed workload (many small numbers, booleans, and structural tokens) the gap closes to ~1 % because the vtable-call overhead is a smaller fraction of the per-token work.

### Design decisions

No changes were made to the vtable or trampoline design.  The benchmark baseline is the Rust `asmjson/zmm` path rather than a dedicated "no-vtable" comparison, which keeps the measurement directly actionable: the assembly entrypoint needs to match or beat the Rust zmm path to justify its complexity.

**Commit**: `6525c72` — bench: add asmjson/zmm_dyn to all three criterion groups

---

## Session 7 — Replace `TapeEntry` enum with a 16-byte `#[repr(C)]` struct

### What was done

Replaced the `pub enum TapeEntry<'a>` (whose Rust-enum representation was
compiler-determined and varied by variant) with a fixed-size `#[repr(C)]
pub struct TapeEntry<'a>` that is exactly **16 bytes** on all platforms:

| word | offset | meaning |
|------|--------|---------|
| 0 | 0 | bits 63–60: `TapeEntryKind` discriminant (0–10); bits 27–0: string length **or** object/array end-index |
| 1 | 8 | `*const u8` pointer to string bytes; null for non-string kinds |

A companion `#[repr(u8)] pub enum TapeEntryKind` carries the fixed
discriminants (Null=0, Bool=1, … EndArray=10).  These values are part of
the public ABI that `parse_json_zmm_tape.S` will consume.

For `EscapedString` / `EscapedKey` the pointer is the raw `Box<str>` data
pointer whose ownership lives in the entry; `Drop` and `Clone` are
implemented manually to free / deep-copy the heap allocation correctly.

### Design decisions

*Fixed layout*: The primary motivation was to give the forthcoming
`parse_json_zmm_tape.S` assembly a deterministic, ABI-stable representation
to write into directly in 16-byte stores, with no Rust enum layout involved.
The `u64` tag-word encodes the kind in the top four bits and the
length/index in the low 28 bits; the assembly can set an entry in two `mov`
instructions (tag word then pointer).

*Backwards-compat shim*: All old enum-variant names (`TapeEntry::Null`,
`TapeEntry::Bool(v)`, `TapeEntry::StartObject(n)`, …) are kept as
`#[allow(non_snake_case)]` constructor methods / associated constants so the
pre-existing test suite compiled unchanged.  Pattern-match arms that
destructured enum payloads were rewritten to use the accessor methods
(`as_start_object()`, `as_bool()`, etc.).

### Results

`size_of::<TapeEntry>() == 16`, `align_of::<TapeEntry>() == 8`.  All 18
unit tests and 5 doctests pass; zero warnings.

**Commit**: `e89f2fc` — refactor: replace TapeEntry enum with 16-byte repr(C) struct



## Session 9 — direct-write assembly tape parser (`parse_json_zmm_tape`)

### What was done

Added `asm/x86_64/parse_json_zmm_tape.S`, a new hand-written x86-64 AVX-512BW
assembly parser that writes `TapeEntry` values directly into a pre-allocated
array, bypassing all virtual dispatch overhead present in the existing `zmm_dyn`
variant.  Supporting changes: `tape_take_box_str` C helper in `lib.rs`;
`parse_json_zmm_tape` extern declaration; `parse_to_tape_zmm_tape` public
function; `build.rs` and `benches/parse.rs` updated.  Nine new unit tests
(27 total) verify correctness against the reference Rust parser.

### Design decisions

**Register map** — `rbx` holds `tape_len` live in a register (not spilled to
memory) to avoid a load/store on every emitted token.  `r14` is `tape_ptr`
(the base of the pre-allocated `TapeEntry` array), replacing the vtable
pointer in `zmm_dyn`.  `r15` is `frames_buf` (frame-kind stack), and `r12`/`r13`
are `src_base`/`src_end` as before.

**Inline writes** — instead of calling 11 vtable slots, each token type is
written inline:
```asm
lea  rax, [rbx + rbx]
lea  rax, [r14 + rax*8]   ; tape_ptr + tape_len*16
; fill tag_payload and ptr fields ...
inc  rbx                   ; tape_len++
```

**`open_buf`** — a separate `[u64; 64]` array holds the tape index of each
pending `StartObject`/`StartArray`.  On the matching `}` or `]`, the start
entry's `payload` field is back-patched with the end index.

**`tape_take_box_str`** — a `#[no_mangle] extern "C"` Rust helper converts
the `unescape_buf` `String` into a leaked `Box<str>`, writing the raw pointer
and length to out-params.  The assembly calls this for every escaped string or
key, then writes an `EscapedString`/`EscapedKey` `TapeEntry` that owns the box.

**Pre-allocation** — `parse_to_tape_zmm_tape` reserves `src.len() + 2`
entries before calling the assembly; this is always sufficient for valid JSON
(at most one token per input byte) so no reallocation occurs during parsing.

### Bug fixes discovered during testing

Two bugs found while adding correctness tests:

1. **String-at-chunk-boundary EOF failure** — when a string's closing `"` fell
   exactly at a 64-byte chunk boundary, the code set `r11 = .Lerror_from_r11`
   and jumped to `chunk_fetch` with `r10 = .Lafter_value`.  On the following
   `chunk_fetch` the source was exhausted, so `r11` was invoked and the parse
   failed even for a valid top-level string.  Fix: set `r11 = .Leof_after_value`
   in the string and escaped-string emit paths before the chunk-boundary
   fallthrough.

2. **Empty input accepted**`.Leof_after_value` checked only `frames_depth == 0`
   before reporting success, so empty input (`""`) returned `Ok` with an empty
   tape.  Fix: added `test rbx, rbx; jz .Lerror` to reject zero-token output.

### Results

All 27 unit tests pass; all 6 doctests pass (3 ignored).  The implementation
is compiled and linked via `cc::Build` in `build.rs` alongside the existing
`parse_json_zmm_dyn.S`.  Correctness is validated by comparing `TapeEntry`
slices against the reference Rust parser across atoms, plain strings, escaped
strings, long strings (>64 bytes), nested structures, escaped keys, whitespace
variants, and rejection of malformed inputs.

**Commit**: `84bb057` — feat: add parse_to_tape_zmm_tape direct-write assembly parser

## Session 8 — Benchmarks and PAYLOAD_MASK widening

### Benchmarking `parse_to_tape_zmm_tape` vs the field

`cargo bench` was run to compare the three tape parsers: the Rust reference
`zmm`, the dynamic-dispatch assembly `zmm_dyn`, and the new direct-write
`zmm_tape`.

| benchmark     | zmm (Rust) | zmm_dyn   | zmm_tape  | δ tape vs dyn |
|---------------|-----------|-----------|-----------|---------------|
| string_array  | 1.251 ms  | 0.959 ms  | 1.008 ms  | +5% slower    |
| string_object | 1.709 ms  | 1.426 ms  | 1.554 ms  | +9% slower    |
| mixed         | 14.85 ms  | 15.34 ms  | 11.86 ms  | -23% faster   |

On purely string-heavy workloads the vtable-call overhead of `zmm_dyn` is
negligible compared to the SIMD scan time, so the extra indirection costs
nothing and `zmm_dyn` wins.  On `mixed` (twitter-style: many short integer,
boolean, null, and nested-object tokens) the direct tape writes in `zmm_tape`
avoid enough per-token overhead to win by 23%.

### Widening PAYLOAD_MASK from 28 bits to 60 bits

`TapeEntry` stores the kind in bits 63-60 and the payload in bits 59-0, giving
60 bits of payload capacity.  The original constant used only the low 28 bits
(`(1 << 28) - 1`), wasting bits 59-28 and capping string/array lengths
unnecessarily.

**Rust** (`src/tape.rs`): `PAYLOAD_MASK` changed to `u64::MAX >> 4` (bits 59-0).

**Assembly** (`asm/x86_64/parse_json_zmm_tape.S`): the previous
`and r10, 0x0FFFFFFF` could not be widened directly because x86-64 encodes
`and` immediate as a 32-bit sign-extended value (max `0x7FFFFFFF`).  A 60-bit
immediate would require a 64-bit `mov` + `and` pair.  Instead the mask is
applied with a shift pair: `shl r10, 4` / `shr r10, 4`, which clears the top
4 bits without needing a large immediate.  All ten masking sites in the file
were updated.

All 27 unit tests and 6 doctests pass after the change.

**Commit**: `2c59a28` -- refactor: widen TapeEntry payload from 28 to 60 bits

## Session 9 — Perf profiling of `parse_to_tape_zmm_tape`

### What was done

A tight-loop driver (`examples/perf_zmm_tape.rs`) was created to generate
~10 MiB of mixed JSON (same generator as the criterion `bench_mixed` benchmark)
and call `parse_to_tape_zmm_tape` 400 times.  The binary was built with
`CARGO_PROFILE_RELEASE_DEBUG=true cargo build --release --example perf_zmm_tape`
to preserve symbols, then profiled with
`perf record -g --call-graph dwarf -F 999`.

### Results

Flat profile (top user-space functions):

| % cycles | Function |
|----------|----------|
| 43.35 % | `parse_json_zmm_tape` |
| 8.92 % | `perf_zmm_tape::main` (almost entirely `Tape` drop) |
| 8.20 % | `<TapeEntry as Drop>::drop` |
| 4.03 % | `asmjson::is_valid_json_number` |
| 2.92 % | `is_valid_json_number_c` |

`perf annotate` of `parse_json_zmm_tape` identified the hottest states:

* **`.Lkey_end`** -- writing a `TapeEntry` for a key (`mov %r10,(%rax)` at 1.48 %
  of function samples), plus surrounding bit-manipulation (kind-tag ORing,
  pointer store, counter increments).  Every object key emits one entry, so
  this is the dominant hot path on the twitter-like dataset.
* **`.Lkey_chars`** -- inner scan loop for key bytes: `andn`/`or`/`shr`/`tzcnt`
  bitmap walk plus a byte load and `\` check (0.58-0.78 % per instruction,
  ~6 % of function samples collectively).
* **`.Lafter_colon`** -- next-byte fetch and dispatch after `:` (~5 % of function),
  with several `mov`/`tzcnt`/`add` instructions at 0.59-0.95 %.
* **`.Lstring_chars`** -- tape write for string entries (0.89 %).
* **`.Latom_chars`** -- the `call is_valid_json_number_c` instruction (0.88 %).

Many hot instructions use frame-pointer-relative stack slots (`-0x80(%rbp)`,
`-0x98(%rbp)`, etc.) for locals such as `chunk_len`, `string_bitmask`, and
`colon_bitmask`.  These are spilled because the function uses more live values
than the callee-saved registers can accommodate.

### Design decisions

No optimisations were applied in this session; profiling was observation-only.
The main actionable findings are:

1. **Drop overhead (~16 %)**: `TapeEntry::drop` checks `kind == EscapedString ||
   EscapedKey` for every entry.  On mixed JSON most entries are plain strings or
   scalars, so the check always fails, yet each still pays for one `kind()` decode
   plus a branch.  A future optimisation could skip the drop loop by tracking
   escape counts separately or keeping escaped entries in a side-vector.
2. **Number validation (~7 %)**: `is_valid_json_number` + `is_valid_json_number_c`
   together consume 7 % of cycles.  Inlining or simplifying the validator could
   recover meaningful throughput, especially for the integer-heavy mixed workload.
3. **Stack spills in hot loops**: register pressure forces `chunk_len` and the two
   bitmask locals to memory.  Restructuring locals or reducing live-variable count
   could reduce load/store traffic in `.Lkey_chars` and `.Lafter_colon`.

**Commit**: n/a -- profiling only, no source changes

## Session 10 — Skip TapeEntry drops via Tape::has_escapes

### What was done

Profiling showed ~16 % of cycles spent in `<TapeEntry as Drop>::drop`, which
checks `kind == EscapedString || EscapedKey` for every entry even when the
tape contains none.  The fix: add a `has_escapes: bool` field to `Tape` and
skip per-element destructors when it is `false`.

**Changes:**

* `src/tape.rs``Tape` gains `pub(crate) has_escapes: bool`.  A `Drop for
  Tape` impl is added: when `!has_escapes` it calls `unsafe { self.entries.set_len(0) }`
  before the Vec drops, so the backing allocation is freed without invoking
  `TapeEntry::drop` on each element.  `TapeWriter` gains the same field,
  set to `true` inside `escaped_string` and `escaped_key`, then forwarded to
  `Tape` in `finish()`.

* `src/lib.rs``parse_json_zmm_tape` extern declaration gains an 8th
  argument `has_escapes_out: *mut bool`.  `parse_to_tape_zmm_tape` initialises
  `let mut has_escapes = false`, passes `&raw mut has_escapes` to the assembly,
  and propagates it into the returned `Tape`.

* `asm/x86_64/parse_json_zmm_tape.S` — documents the new 8th argument
  (`[rbp+24]`, `.equ LOC_HAS_ESC_OUT, +24`).  Both `.Lsc_emit_escaped` and
  `.Lke_emit_escaped` store `1` to `*has_escapes_out` immediately after
  writing the tape entry. No new stack space needed (the argument lives in
  the caller's frame above the saved `rbp`).

### Design decisions

Setting the flag in the assembly at the two emit sites keeps the hot paths
(plain strings, keys, numbers) unchanged.  The alternative of scanning the
tape after parsing would have been O(n) on every call.

`TapeEntry::drop` is kept unchanged for correctness when entries are used
outside a `Tape` (e.g. constructed in tests).

### Results

All 27 unit tests and 6 doctests pass.

**Commit**: `3ec8fba` -- perf: skip TapeEntry drops via Tape::has_escapes flag

## Session 11 — SWAR digit fast path for short numbers

### What was done

Profiling showed ~7 % of cycles in `is_valid_json_number` + `is_valid_json_number_c`.
The vast majority of numbers in twitter-like JSON are plain integers up to 8 bytes
(e.g., `"id": 12345678`).  These can be validated without a function call by using
SWAR (SIMD Within A Register) bit tricks inside `.Lemit_atom`.

The fast path is applied in both `parse_json_zmm_tape.S` and `parse_json_zmm_dyn.S`.

### Design: SWAR all-digits check

For each byte b in the loaded qword, the check exploits the layout of ASCII digits
('0' = 0x30 .. '9' = 0x39):

```
t = (b | 0x80) - 0x30

Lower bound (b >= '0'):  bit 7 of t = 1
  Setting the top bit ensures (b|0x80) >= 0x80.  For b >= 0x30 the subtraction
  0xB0..0xBF - 0x30 = 0x80..0x8F leaves bit 7 set.  For b < 0x30 the result
  drops to at most 0x7F (top bit clear) -- the borrow has consumed bit 7.

Upper bound (b <= '9'):  (t + 0x06) & 0x10 == 0
  A digit gives t-byte = 0x80..0x89; adding 0x06 = 0x86..0x8F, bit 4 clear.
  A byte > '9' gives t-byte >= 0x8A; adding 0x06 >= 0x90, bit 4 set.
```

Whole-word check:

```asm
  mov  r10, 0x8080808080808080
  or   r10, rax                 ; set top bit per byte
  mov  r11, 0x3030303030303030
  sub  r10, r11                 ; t = (b|0x80)-0x30 per byte
  ; lower bound
  mov  r11, r10 / not r11
  mov  rax, 0x8080808080808080
  test r11, rax                 ; ZF=1 => all bytes >= '0'
  ; upper bound
  mov  r11, 0x0606060606060606
  add  r10, r11
  mov  r11, 0x1010101010101010
  test r10, r11                 ; ZF=1 => all bytes <= '9'
```

Note: `sub`/`add`/`test` cannot encode 64-bit immediates on x86-64 (max 32-bit
sign-extended).  Large constants are loaded into a register via `mov r64, imm64`
first.

### Algorithm

1. If `rdx > 8`: always use full validator (atom doesn't fit in a qword).
2. If `rsi + 8 > src_end`: fewer than 8 bytes remain in source buffer -- safe
   to load only `rdx` bytes, but need padding; fall back to validator instead.
3. Leading-zero guard: if first byte is '0' and `rdx > 1`, fall back to
   validator (would otherwise accept invalid "01", "007", etc.).
4. Load 8 bytes from `rsi`; fill the `8 - rdx` unused high bytes with '0'
   (0x30) using a shift-derived mask, so they vacuously pass the digit check.
5. SWAR check.  If all bytes are digits: write Number entry directly.
6. Otherwise: call `is_valid_json_number_c` (handles '-', '.', 'e', leading
   zeros, etc.).

### Results

All 27 unit tests pass, including new boundary tests:
- Pure integers 1--8 bytes long hit the fast path and match the dyn reference.
- A 9-byte integer ("123456789") correctly falls through to the full validator.
- Leading-zero inputs ("01", "00", "007", "01234567") are still rejected.

**Commit**: `bae1632` -- perf: SWAR digit fast path for short numbers in .Lemit_atom

## Session 12 — perf profile and NT-store experiment

### Profiling zmm_tape

Ran `perf record -g --call-graph dwarf` on `perf_zmm_tape` (400 iterations of
~10 MiB mixed JSON).  Flat profile (self %):

| Symbol | Self% |
|---|---|
| `parse_json_zmm_tape` | 59.4 % |
| `asmjson::is_valid_json_number` | 1.3 % |
| `is_valid_json_number_c` | 0.6 % |
| all allocator / drop | ~0 % |

Inside the parser the profile is very flat — no instruction exceeds 2 %.
The three hottest instructions (~1.9 % combined) are the `mov %r10,(%rax)` tape
entry tag stores.  Number validation after the SWAR fast path is now ~2 % total.
Memory/drop overhead is effectively zero thanks to `has_escapes`.

### NT store experiment

Replaced every tape entry write (`mov qword ptr [rax], r10` and
`mov qword ptr [rax + 8], ...`) with `movnti` (non-temporal store), which
bypasses the cache on write.  Added `sfence` before the function `ret`.

**Result: 3–5 % regression on all three bench workloads.**

Reason: the benchmark iterates over ~1 MiB of JSON many times.  The tape fits
in L3 cache.  With regular stores the L3 is warm when `tape_sum_lens` traverses
the tape immediately after parsing; with `movnti` the traversal refetches from
DRAM.  NT stores are appropriate only when the working set exceeds L3 (large
one-shot streams where the tape would be evicted before the consumer reads it).

The commit was reverted (`0673d7d`).

### Design decision

Non-temporal stores are a context-dependent trade-off:

- **Beneficial**: streaming workloads with tapes larger than L3 (e.g., multi-MB
  one-shot document ingestion) where write and read are separated by enough work
  or time to cause natural eviction.
- **Harmful**: small/medium JSON or repeated parsing where the tape stays hot in
  L3 (as in the criterion bench).

No further action taken; existing `mov` stores are optimal for the benchmark
profile.

**Commits**: `e9bf4e1` NT stores (then `0673d7d` revert)
---

## Session 13 — promote hot stack slots to live registers r8/r9

### Motivation

The perf profile from session 12 highlighted two stack-slot loads as the
highest-weight individual instructions:

| LOC slot        | sample weight |
|-----------------|---------------|
| LOC_CHUNK_LEN   | 6.18 %        |
| LOC_POS         | 5.20 %        |

Both are read on every iteration of the inner dispatch loop (chunk_offset
advance, `cmp rcx, chunk_len`, `lea rdx, [r12 + pos]`).

### Design decisions

**Register selection**: After the prologue, `r8` (which carried `frames_buf`
in the calling convention) is moved to `r15`, and `r9` (which carried
`open_buf`) is spilled to `LOC_OPEN_BUF`.  Both `r8` and `r9` are therefore
free as caller-saved scratch registers for the rest of the function.

| Register | Live value   | Stack spill home |
|----------|-------------|-----------------|
| `r8`     | chunk\_len  | `LOC_CHUNK_LEN` |
| `r9`     | pos         | `LOC_POS`       |

**Spill sites**: External calls (`unescape_str`, `tape_take_box_str`,
`is_valid_json_number_c`) are caller-saved clobbers, so r8/r9 must be
saved to their stack homes before each call cluster and restored afterward.
These paths are hit only for escaped strings/keys and numbers that fail the
fast SWAR path — all are rare in typical JSON.

**zmm\_space pointer conflict**: The `.Lclassify_do` block previously used
`r9` as a scratch pointer to the `.Lzmm_space` lookup table.  Moved to `rdi`
(safe because no calls occur in `chunk_fetch`).

**Prologue init**: Changed `mov qword ptr [rbp + LOC_POS], rax` →
`xor r9d, r9d` and `mov qword ptr [rbp + LOC_CHUNK_LEN], rax` →
`xor r8d, r8d`.

**chunk\_fetch advance**: Collapsed the old three-instruction sequence
```
mov rax, [rbp + LOC_CHUNK_LEN]
add [rbp + LOC_POS], rax
mov rax, [rbp + LOC_POS]           ; → r9 after sed
lea rdx, [r12 + rax]
```
into two instructions:
```
add r9, r8                          ; pos += chunk_len
lea rdx, [r12 + r9]                 ; chunk_ptr
```

### Results

```
mixed/asmjson/zmm_tape     −5.3 %  time   (+5.6 % throughput)  vs previous baseline
string_array/asmjson/zmm_tape  −1.7 %  time
string_object/asmjson/zmm_tape  −0.7 %  time  (within noise)
```

All 27 unit tests and 6 doc-tests pass.

**Commit**: `bc7891b` perf: promote LOC_CHUNK_LEN and LOC_POS to live registers r8/r9
## Session 17 — TapeOverflow error code with capacity-doubling retry

### What was done

Changed `parse_json_zmm_tape` from returning a `bool` (`1`=ok, `0`=error) to
returning a `u8` error code:

| Constant            | Value | Meaning                             |
|---------------------|-------|-------------------------------------|
| `RESULT_OK`         | `0`   | Parse succeeded                     |
| `RESULT_PARSE_ERROR`| `1`   | Invalid JSON                        |
| `RESULT_TAPE_OVERFLOW`| `2` | Tape buffer was too small           |

The Rust wrapper `parse_to_tape_zmm_tape` now starts with a conservative tape
capacity of `(src.len() / 4).max(2)` and doubles on every `RESULT_TAPE_OVERFLOW`
response until the parse succeeds.

### Design decisions

**Capacity checks in assembly**: A `cmp rbx, qword ptr [rbp + LOC_TAPE_CAP]`
/ `jae .Ltape_overflow` pair was inserted before every tape write site — 16
sites in total. The tape capacity is passed as a 9th stack argument (`LOC_TAPE_CAP = +32`).

**`.Lemit_atom` strategy**: `.Lemit_atom` uses `al=1/0` as an internal success
flag. Inserting a third return value there would have broken all callers that
use `test al, al; jz .Lerror`. Capacity checks were placed at the two *call
sites* instead (`.Latom_chars` and `.Latom_eof_flush`), leaving `.Lemit_atom`
internals unchanged.

**Memory safety on overflow**: Any `EscapedString`/`EscapedKey` entries already
written to the tape own `Box<str>` data. If the Vec is dropped with `len=0`,
those allocations leak. The `.Ltape_overflow` path first writes the partial
`rbx` (number of valid entries) to `*tape_len_out`, then returns `2`. The Rust
`RESULT_TAPE_OVERFLOW` arm calls `unsafe { tape_data.set_len(tape_len) }` so
the Vec correctly drops those entries before growing and retrying.

**Initial capacity**: `(src.len() / 4).max(2)` is intentionally small so the
retry path is exercised even on moderately sized inputs.

### Results

28 unit tests and 6 doc-tests pass. The new test `zmm_tape_overflow_retry`
builds a 200-element JSON array (~800+ tape entries), verifying that the
capacity-doubling retry produces the correct result.

**Commit**: `6c87ff4` feat: TapeOverflow error code with capacity-doubling retry

## Session 18 — Optimisation tips in README

### What was done

Added an **Optimisation tips** section to `README.md` (between Quick start and
Output formats) with two executable doc-test examples:

1. **Cache field refs from a one-pass object scan** — shows iterating a root
   object with `object_iter` once and storing the desired `TapeRef` values,
   avoiding the repeated O(n_keys) re-scan that `get(key)` performs on each
   call.

2. **Collect array elements for indexed or multi-pass access** — shows
   collecting `array_iter` results into a `Vec<TapeRef>`, giving O(1) random
   access and free additional passes over the same data.

### Design decisions

`TapeRef` is `Copy` (two `usize` fields), so storing it is cheap and safe for
the lifetime of the tape borrow.  The examples highlight this property
explicitly so users understand that there is no heap cost to caching refs.

The existing `Conformance note` section was added in the prior session; the new
section was inserted between Quick start and Output formats where it is most
visible to new users deciding how to traverse the parsed data.

### Results

28 unit tests + 8 doc-tests (including 2 new README examples) pass.

**Commit**: `e9ce7d8` docs: add optimisation tips — caching TapeRefs from object/array iterators

---

## Session 19 — serde feature: `from_taperef` Deserializer

### What was done

Added a `serde` optional feature that implements `serde::Deserializer<'de>` for
`TapeRef<'de, 'de>` and exposes a `from_taperef` top-level function.  The
feature gate is `--features serde`, satisfied by adding an optional serde 1.x
dependency in `Cargo.toml`.

Files changed:

- `Cargo.toml` — optional `serde = { version = "1", features = ["derive"] }`
  dependency; `serde` feature entry.
- `src/tape.rs` — added `pub(crate) fn source_string(&self) -> Option<&'a str>`
  on `TapeEntry`, which returns the source-JSON-lifetime `&'a str` for plain
  (non-escaped) `String` entries, enabling zero-copy deserialization.
- `src/de.rs` — new file containing the full serde integration:
  - `Error(String)` implementing `serde::de::Error`.
  - `impl<'de> Deserializer<'de> for TapeRef<'de, 'de>` — full dispatch over
    every `deserialize_*` method.
  - `TapeSeqAccess<'de>` (wraps `TapeArrayIter`), `TapeMapAccess<'de>` (wraps
    `TapeObjectIter`), and `KeyDeserializer<'de>` for borrowed object keys.
  - `UnitVariantAccess` / `UnitOnly` for string-valued unit enum variants.
  - `TapeEnumAccess` / `impl VariantAccess for TapeRef` for
    `{"Variant": payload}` style newtype/struct/tuple enum variants.
  - `pub fn from_taperef<'de, T: Deserialize<'de>>(r: TapeRef<'de, 'de>) ->
    Result<T, Error>` as the public entry point.
- `src/lib.rs``#[cfg(feature = "serde")] pub mod de;` and re-export of
  `from_taperef`.

### Design decisions

**Lifetime unification** — the `Deserializer` impl uses `TapeRef<'de, 'de>`
(both the tape-borrow and source-JSON lifetimes collapsed to `'de`).  This is
the common case: the tape and its source string both outlive the deserialization
scope.  `from_taperef` enforces this through its own signature.

**Zero-copy strings** — plain `String` tape entries (no escape sequences) are
deserialized via `visit_borrowed_str`, borrowing directly from the source JSON.
Escaped strings (heap-allocated `Box<str>`) go through `visit_str` and are
copied into the target type.

**Enum variants** — two strategies are used: a bare JSON string `"Foo"` maps to
a unit variant (the `TapeRef::deserialize_identifier` path); a single-key
object `{"Foo": value}` maps to any variant kind (newtype, struct, tuple) via
`TapeEnumAccess` + `impl VariantAccess for TapeRef`.

**Key deserializer** — `TapeObjectIter` yields `(&'t str, TapeRef)` pairs.
When both lifetimes are `'de`, the key is already `&'de str` and
`KeyDeserializer` passes it zero-copy to the visitor via `visit_borrowed_str`.

**`forward_to_deserialize_any!`** — used in `KeyDeserializer` to delegate every
unneeded `deserialize_*` method to `deserialize_any`, keeping boilerplate
minimal.

### Results

28 unit tests + 9 doc-tests (including the new `de::from_taperef` doctest) pass.

**Commit**: `9cad231` feat: add serde feature with from_taperef Deserializer for TapeRef

---

## Session 20 — parse_with auto-dispatches to asm dyn for classify_zmm

### What was done

Replaced the eleven `TapeWriter`-specific `extern "C"` trampolines
(`tw_null`, `tw_bool_val`, …) with a generic mechanism that works for any
`JsonWriter`:

- **`WriterForZmm` internal bridge trait** (`pub(crate)`) — exposes every
  `JsonWriter` method via raw `(*const u8, usize)` pairs, hiding the source
  lifetime `'a`. A blanket `impl<'a, W: JsonWriter<'a>> WriterForZmm for W`
  provides the implementation, using `std::mem::transmute` to re-attach the
  correct `'a` lifetime to the string slice before calling the concrete writer
  method.  This is the same unsafety pattern the old single-type trampolines
  used, generalised over `W`.

- **Generic trampolines** `zw_null::<W>`, `zw_string::<W>`, … — `unsafe extern
  "C"` free functions monomorphised per writer type.  `build_zmm_vtab::<W>()
  -> ZmmVtab` assembles them into a `ZmmVtab` on the stack.

- **`parse_with` fast path** — when `classify == classify_zmm`, AVX-512BW is
  present, the source starts with `{` or `[` (object / array), and the source
  contains no backslash, `parse_json_zmm_dyn` is called with a
  `W`-monomorphised vtable.  All other inputs fall back to the Rust path.

- **`parse_to_tape_zmm_dyn` simplified** — now calls `build_zmm_vtab::<
  TapeWriter<'a>>()` instead of inlining the deleted `tw_*` trampolines.

### Design decisions

**Guard conditions for the fast path.**  During testing, two pre-existing
limitations of `parse_json_zmm_dyn` were uncovered by the new routing:

1. The dyn asm crashes (SIGSEGV) on any input containing a backslash — it
   calls the `escaped_string` / `escaped_key` vtab slots but does not
   implement the unescape logic the Rust path provides.
2. The dyn asm returns `false` (parse failure) for top-level JSON strings and
   for bare scalars at the root (`"hello"` → None, but `null` / numbers / `{}`
   / `[]` work).

These limitations were masked before because the test suite used
`parse_to_tape(src, classify_zmm)` as the _reference_ oracle (Rust path), and
the fast path had not yet been wired up.  Rather than fixing the dyn asm (out
of scope), the fast path guards against both conditions:

```rust
&& !src.contains('\\')
&& src.bytes().find(|&b| !b" \t\r\n".contains(&b))
       .map_or(false, |b| b == b'{' || b == b'[')
```

**Generic trampolines via `WriterForZmm`.**  The original trampolines cast
`data` to `*mut TapeWriter<'static>`, relying on lifetime erasure.  To
generalise, the bridge trait's methods reconstruct `&'a str` from raw pointers
using `transmute`, where `'a` is the concrete lifetime from the
`impl<'a, W: JsonWriter<'a>> WriterForZmm for W` monomorphisation.  This is
sound for the same reason the original trampolines were sound: the assembly
call is synchronous and `src` outlives it.

### Results

28 unit tests + 9 doc-tests pass (same as before; the new routing is
transparent to all existing tests).

**Commit**: `0c5c260` feat: parse_with auto-dispatches to asm dyn when classify_zmm is used
---

## Session 20 — Drop ClassifyFn; four clean public entry points

### What was done

Removed the `ClassifyFn` type alias, `choose_classifier()`, and all `classify`
parameters from the public API in response to user feedback: "Lets drop
classifyfn and make the rust version always use the SWAR version. Have only
four entrypoints to lib.rs: `parse_with()`, `parse_to_tape()`, `unsafe
parse_with_zmm()`, `unsafe parse_to_tape_zmm()`. It is up to the user to make
sure that the CPU supports avx512bw."

Files changed: `src/lib.rs`, `src/tape.rs`, `src/json_ref.rs`, `src/de.rs`,
`benches/parse.rs`, `examples/perf_zmm_tape.rs`, `README.md`.

### Design decisions

**Four entry points only.** The new surface area is:

| Function | Safety | Classifier |
|---|---|---|
| `parse_to_tape(src)` | safe | SWAR (u64) |
| `parse_with(src, writer)` | safe | SWAR (u64) |
| `unsafe parse_to_tape_zmm(src, cap)` | unsafe | AVX-512BW asm |
| `unsafe parse_with_zmm(src, writer)` | unsafe | AVX-512BW asm vtable |

**Rust path always uses SWAR.** `parse_json_impl` had its `F: Fn(&[u8]) ->
ByteState` generic parameter removed; it now calls `classify_u64` directly.
This eliminates the abstraction cost of a function-pointer generic and the
`choose_classifier` CPUID dance for the common case.

**`unsafe` for AVX-512BW variants.** Rather than asserting at runtime (old
`parse_to_tape_zmm_tape` panicked if the CPU lacked AVX-512BW), the two asm
entry points are `unsafe fn`.  The assertion is removed; callers declare with
`unsafe` that they have verified CPU support.  This aligns with Rust's
philosophy and avoids hidden panics in libraries.

**`parse_to_tape_zmm_dyn` removed.** Its functionality is now exposed through
`parse_with_zmm`, which accepts any `JsonWriter<'a>`.  The vtable-dispatch asm
path is accessible without requiring a public `TapeWriter` type.

**`classify_ymm` / `classify_zmm` retained as `#[cfg(test)]`.** The
`classifier_agreement` unit test exercises all three classifiers against each
other to verify correctness.  Moving these to `#[cfg(test)]` suppresses dead
code warnings while keeping the coverage.

**`parse_with` no longer auto-dispatches to asm.** The previous session added
logic to call `parse_json_zmm_dyn` automatically from `parse_with` when a
`classify_zmm` argument was detected.  That heuristic is now gone: `parse_with`
is purely the Rust SWAR path; the user explicitly calls `parse_with_zmm` for
the asm path.

### Results

28 unit tests + 7 doc-tests pass (2 doc-tests now ignored due to platform/feature
guards, net -2 from the 9 previously; no regressions).

**Commit**: `c9a266b` refactor: drop ClassifyFn; four clean entry points with unsafe zmm variants
---

## Session 22 — SAX writer, bench rename, remove classify_*

### Remove classify_ymm / classify_zmm / ByteStateConstants

`classify_ymm`, `classify_zmm`, `ByteStateConstants`, `ZMM_CONSTANTS`, and the
`classifier_agreement` unit test were removed entirely from `src/lib.rs`.  Even
`#[cfg(test)]` guards were dropped — the classifier logic is embedded inside the
asm trampolines and needs no Rust-level exposure.

27 unit tests pass (the ignored zmm tests that need the `avx512bw` feature are
still present but not compiled on non-AVX-512 CI machines).

**Commit**: `6484322` refactor: remove classify_ymm, classify_zmm, ByteStateConstants, and classifier_agreement test

### LenSumWriter + bench rename (asmjson/sax, asmjson/dom)

Added a `LenSumWriter` struct to `benches/parse.rs` that implements
`JsonWriter<'src, Output = usize>` and accumulates the total byte length of all
string and key values encountered.  This gives a meaningful SAX-style workload
with no tape allocation.

Renamed benchmark slots in all three groups (string_array, string_object, mixed):

- `asmjson/zmm`**`asmjson/sax`** — calls `parse_with_zmm(&data, LenSumWriter::new())`, single-pass, no heap allocation for the tape.
- `asmjson/zmm_tape`**`asmjson/dom`** — calls `parse_to_tape_zmm(&data, None)` then traverses the tape with `tape_sum_lens`.
- `asmjson/u64` — unchanged (safe SWAR path, builds tape).

### Design decisions

The SAX path (`parse_with_zmm` + `LenSumWriter`) avoids the tape allocation and
the subsequent linear scan.  The benchmark therefore measures the parser's
throughput in isolation.  The DOM path (`parse_to_tape_zmm`) keeps the old
comparison point.

`LenSumWriter::finish` returns `Some(self.total)` so the result can be passed to
`black_box`, preventing the compiler from eliding the computation.

### Results

| Parser | string_array | string_object | mixed |
|---|---|---|---|
| asmjson/sax | 10.78 GiB/s | 8.29 GiB/s | 1.17 GiB/s |
| asmjson/dom | 10.93 GiB/s | 6.94 GiB/s | 897 MiB/s |
| asmjson/u64 | 7.02 GiB/s | 4.91 GiB/s | 607 MiB/s |
| sonic-rs | 6.92 GiB/s | 4.06 GiB/s | 478 MiB/s |
| serde_json | 2.41 GiB/s | 534 MiB/s | 78 MiB/s |
| simd-json | 1.91 GiB/s | 1.19 GiB/s | 174 MiB/s |

`asmjson/sax` wins on string_object and mixed (no tape-write overhead), while
`asmjson/dom` edges ahead on string_array (the extra tape scan is cheap relative
to the dominance of string-data throughput).

**Commit**: `855c37c` bench: add LenSumWriter, rename zmm→sax, zmm_tape→dom; update conversation log

---

## Session 23 — Module restructure: dom / sax

### What was done

Reorganised the crate's public module layout:

- **`src/tape.rs``src/dom/mod.rs`** — the flat-tape types (`Tape`,
  `TapeEntry`, `TapeEntryKind`, `TapeRef`, `TapeWriter`, iterators) now live
  under the `dom` module, reflecting that they implement a DOM (document object
  model) representation.

- **`src/json_ref.rs``src/dom/json_ref.rs`** — the `JsonRef` trait moved
  into `dom` as a submodule.  Its import of `tape::` became `super::` (parent
  module).

- **`src/sax.rs` (new)** — the `JsonWriter` trait was renamed to `Sax` and
  extracted to its own module.  All internal references in `lib.rs` were
  updated from `JsonWriter` to `Sax`.  The `TapeWriter` implementation and
  the `WriterForZmm` blanket impl were updated accordingly.

- **`src/lib.rs`** — module declarations updated (`tape``dom`, new `sax`),
  re-exports updated (`pub use dom::…`, `pub use dom::json_ref::JsonRef`,
  `pub use sax::Sax`), `JsonWriter` trait definition removed.

- **`src/de.rs`**`use crate::tape::``use crate::dom::`.

- **`benches/parse.rs`**`use asmjson::JsonWriter``use asmjson::sax::Sax`,
  `impl JsonWriter for LenSumWriter``impl Sax for LenSumWriter`.

### Design decisions

Naming `Sax` aligns with the common use of "SAX" (Simple API for XML) to
describe event-driven, streaming parsers — the `Sax` trait is called once
per token with no tree built.  Grouping the DOM types under `dom` makes the
complementary structure explicit: `asmjson::dom::Tape` for tree-shaped access,
`asmjson::sax::Sax` for streaming.

The `WriterForZmm` private bridge trait updates were purely mechanical.

### Results

All tests pass (27 unit + doc-tests).  No regressions.  One pre-existing
`dead_code` warning on `source_string` remains unchanged.

**Commit**: `2640796` refactor: rename tape→dom, json_ref into dom, JsonWriter→Sax in sax module

---

## Session 24 — Rename Tape→Dom, TapeEntry→DomEntry

### What was done

All occurrences of `Tape` (the flat token array struct) were renamed to `Dom`,
and all occurrences of `TapeEntry` (the 16-byte token struct) were renamed to
`DomEntry` across the entire codebase.

Files changed: `src/dom/mod.rs`, `src/lib.rs`, `src/dom/json_ref.rs`,
`src/sax.rs`, `benches/parse.rs`.

Unchanged names: `TapeRef`, `TapeWriter` (private), `TapeArrayIter`,
`TapeObjectIter`, `TapeEntryKind`.

### Results

All 27 unit tests and 7 doc-tests pass.  No regressions.

**Commit**: `f386977` refactor: rename Tape→Dom, TapeEntry→DomEntry

---

## Session 25 — Complete Dom* rename + asm module rename

### What was done

Completed the rename of all remaining `Tape*` identifiers to `Dom*` and renamed
the two x86-64 assembly files to use `_sax` / `_dom` suffixes to match the
module naming established in sessions 23–24.

Files touched: `src/lib.rs`, `src/dom/mod.rs`, `src/dom/json_ref.rs`,
`src/de.rs`, `src/sax.rs`, `benches/parse.rs`, `examples/perf_zmm_tape.rs`
(→ renamed `examples/perf_zmm_dom.rs`), `README.md`, `build.rs`,
`asm/x86_64/parse_json_zmm_dyn.S` (→ `parse_json_zmm_sax.S`),
`asm/x86_64/parse_json_zmm_tape.S` (→ `parse_json_zmm_dom.S`).

Full rename table:

| Old name | New name |
|---|---|
| `TapeRef` | `DomRef` |
| `TapeArrayIter` | `DomArrayIter` |
| `TapeObjectIter` | `DomObjectIter` |
| `TapeEntryKind` | `DomEntryKind` |
| `TapeWriter` | `DomWriter` |
| `tape_skip` | `dom_skip` |
| `tape_take_box_str` | `dom_take_box_str` |
| `parse_json_zmm_dyn` | `parse_json_zmm_sax` |
| `parse_json_zmm_tape` | `parse_json_zmm_dom` |
| `parse_to_tape` | `parse_to_dom` |
| `parse_to_tape_zmm` | `parse_to_dom_zmm` |
| `tape_sum_lens` (bench) | `dom_sum_lens` |
| `parse_json_zmm_dyn.S` | `parse_json_zmm_sax.S` |
| `parse_json_zmm_tape.S` | `parse_json_zmm_dom.S` |
| `examples/perf_zmm_tape.rs` | `examples/perf_zmm_dom.rs` |

### Design decisions

The `_sax` suffix is used for the AVX-512 path that dispatches through a
trait-object vtable (the SAX/event-driven interface), and `_dom` for the path
that writes directly into a flat DOM tape.  This mirrors the Rust-side
`Sax` trait / `Dom` struct split introduced in sessions 23–24.

README code examples are included as doc-tests via `include_str!`, so the
README also needed updating — caught by a second failing `cargo test` pass.

### Results

27 unit tests + 7 doc-tests, 0 failures after all renames.

### Commit

`98870e1` refactor: rename remaining Tape*→Dom*, asm modules to _sax/_dom suffixes

---

## Session — DOM and SAX example files

### What was done

Added two standalone examples that demonstrate both parse modes and both
x86-64 assembly vs portable entry points:

- `examples/dom_example.rs` — builds a [`Dom`] tape and navigates it with
  the [`JsonRef`] cursor API.  Accepts an optional `zmm` argument to switch
  from `parse_to_dom` (SWAR) to `parse_to_dom_zmm` (AVX-512BW assembly).
- `examples/sax_example.rs` — implements the [`Sax`] trait (`Counter`) and
  drives it through `parse_with` or `parse_with_zmm`.  The SAX example notes
  that `parse_with_zmm` does not process backslash escape sequences.

### Design decisions

- A runtime `-- zmm` argument was used instead of a compile-time flag so
  either path can be exercised without rebuilding; the non-x86_64 fallback
  prints an informative error and exits.
- The SAX example uses escape-free JSON (`"rust"`, `"json"`, etc.) so it
  works correctly under both `parse_with` and `parse_with_zmm`.
- Each `inspect` / `run_*` function in the DOM example is duplicated for
  clarity; sharing via a generic closure would obscure which API is in use.

### Results

Both examples compile without errors (`cargo build`).  Output verified by
running `cargo run --example dom_example` and `cargo run --example sax_example`,
and again with `-- zmm` on this AVX-512BW-capable host; counts and field
values matched expectations for both SWAR and assembly paths.

### Commit

65aff4e fix: two bugs in SAX assembly escape path---

## Session N — Fix two bugs in the assembly SAX escape path

### Stale re-save of `rcx` in `.Lsc_emit_escaped`

The assembly SAX string-value emission path (`.Lsc_emit_escaped`) saved
`rcx` (the chunk offset) into `LOC_COFF` before calling `unescape_str`, but
then unconditionally re-saved `rcx` *again* after loading the decoded String
fields:

```asm
    mov     qword ptr [rbp + LOC_COFF], rcx  ; correct save
    ...
    call    unescape_str                      ; clobbers rcx (caller-saved!)
    mov     rsi, qword ptr [r8]
    mov     rdx, qword ptr [r8 + 16]
    mov     qword ptr [rbp + LOC_COFF], rcx  ; BUG: rcx is garbage here
```

`rcx` is caller-saved in the System V AMD64 ABI, so `unescape_str` was free
to clobber it.  The second save overwrote the correct chunk-offset with
whatever value `rcx` held on return from `unescape_str`.  The fix was to
delete the redundant second save.  The corresponding key path
(`.Lke_emit_escaped`) did not have this defect.

### Wrong `String` field offsets — `{cap, ptr, len}` vs `{ptr, cap, len}`

After removing the stale save, the test still crashed with SIGSEGV.  A
runtime layout probe added to `parse_with_zmm` revealed:

```
String layout probe: w0=0x10 (cap=16)  w1=0x79a1d0000ce0 (ptr)  w2=0xb (len=11)
```

The assembly assumed `String = {ptr@0, cap@8, len@16}`, but the Rust
compiler in use lays out `Vec<u8>` as `{cap@0, ptr@8, len@16}`.  The SAX
assembly was reading `[r8]` (cap) instead of `[r8 + 8]` (ptr) for the
decoded string pointer, producing a garbage pointer value (e.g. `0x8`
when cap=8) that caused the segfault.  The DOM assembly was unaffected
because it delegates String field access to the Rust function
`dom_take_box_str`.

The fix updates both `.Lsc_emit_escaped` (string values) and
`.Lke_emit_escaped` (key values) to read the pointer from `[r8 + 8]`:

```asm
    mov     rsi, qword ptr [r8 + 8]    // box_ptr  (ptr at offset 8)
    mov     rdx, qword ptr [r8 + 16]   // box_len  (len at offset 16)
```

A comment in the assembly documents the verified field layout.

### Results

All 29 unit tests pass (`cargo test --lib -- --test-threads=1`).  The
`zmm_sax_escaped_strings` test, which verified `parse_with_zmm` against the
Rust reference parser on inputs with `\n`, `\t`, `\r`, `\"`, `\uXXXX`,
escaped keys, and escape sequences spanning chunk boundaries, now passes
cleanly.

### Commit

f6bd9f4 fix: two bugs in SAX assembly escape path

---

## Session N+1 — Eliminate String layout assumptions by changing escaped_string/escaped_key to &str

### Motivation

The previous fix for the SAX assembly SIGSEGV revealed the root cause:
the assembly was reading `String` fields at hard-coded offsets assuming
`{ptr@0, cap@8, len@16}`, but the Rust compiler in use laid out `Vec<u8>`
as `{cap@0, ptr@8, len@16}`.  Rather than just updating the offsets and
hoping the layout stays stable, the approach was changed to eliminate the
need to read `String` fields from assembly entirely.

### Design decision

The `Sax` trait methods `escaped_string` and `escaped_key` previously took
`Box<str>` so the writer could own the decoded text.  The assembly
trampolines were responsible for constructing the `Box<str>` by reading
`ptr` and `len` from the `unescape_buf` `String` struct.

Changing the signature to `&str` (matching `string` and `key`) means the
trampolines need only cast the raw `(ptr, len)` they already receive to a
`&str` — the same one-liner pattern used for all other string methods.
No `String` field access in assembly is required at all.

The `&str` is a short-lived borrow of `unescape_buf`'s heap buffer.  It
is valid for the duration of the vtable call.  Writers that only inspect
the value (benchmarks, examples) incur zero allocation.  Writers that
need ownership (e.g. `DomWriter`) copy with `Box::from(s)` internally.

### Changes

- `src/sax.rs`: `Sax` trait — `Box<str>``&str`.
- `src/lib.rs`: `WriterForZmm` trait and blanket impl updated; trampolines
  for `zw_escaped_string` / `zw_escaped_key` reduced to one-liners;
  SWAR `parse_json_impl` call sites changed from `.as_str().into()` to
  `.as_str()`; test `EventLog` signatures updated.
- `src/dom/mod.rs`: `DomWriter::escaped_string` / `escaped_key` now take
  `&str` and do `Box::from(s)` to preserve internal `Box<str>` ownership.
- `examples/sax_example.rs`, `benches/parse.rs`: method signatures updated.
- `asm/x86_64/parse_json_zmm_sax.S`: comment updated (`box_ptr``s_ptr`).

### Results

All 29 unit tests pass.  Benchmarks compile.  The assembly no longer
contains any `String` field offsets.

### Commit

612de06 refactor: change escaped_string/escaped_key to take &str`612de06` refactor: change escaped_string/escaped_key to take &str

## Session — Example CPUID auto-dispatch

### What was done

Refactored `examples/dom_example.rs` and `examples/sax_example.rs` to
auto-select the AVX-512BW assembly path at runtime using
`is_x86_feature_detected!("avx512bw")` instead of requiring a `-- zmm`
command-line flag.

- Removed the two-function table and `-- zmm` usage from both doc comments.
- `dom_example`: merged `inspect` / `inspect_zmm` into a single
  `inspect(label: &str, tape: Dom)` function; `main` does the CPUID check
  and calls the appropriate parser, then passes the result to `inspect`.
  Added `Dom` to the `use` imports.
- `sax_example`: replaced `run_portable` and `run_zmm` with a single
  `report(label: &str, counts: Counter)` function; `main` does the CPUID
  check and calls the appropriate parser, then passes counts to `report`.
  Removed the `std::env::args` CLI argument parsing entirely.

### Design decisions

Mirrored the same CPUID-dispatch pattern in both examples for consistency.
The `#[cfg(target_arch = "x86_64")]` guard around the
`is_x86_feature_detected!(...)` check ensures the examples compile and run
correctly on non-x86_64 targets (falling back to the portable path).

### Results

Both examples compile without warnings and produce correct output.  On an
AVX-512BW machine the assembly path is selected automatically.

### Commit

`601c6ee` examples: CPUID auto-dispatch; remove -- zmm CLI flag

## Session — Parallel mmap JSON-lines example

### What was done

Added `examples/mmap_parallel.rs`: a new example that memory-maps a JSON
Lines file, partitions it into ~1 MiB chunks at `\n` boundaries, then
parses every chunk in parallel using Rayon.  CPUID auto-selects the
AVX-512BW assembly path when available.

Also added `memmap2 = "0.9"` and `rayon = "1"` to `[dev-dependencies]` in
`Cargo.toml`.

### Design decisions

**Why iterate lines within each chunk?**  `parse_with` / `parse_with_zmm`
each expect a single well-formed JSON value.  A raw ~1 MiB slice of a JSON
Lines file contains hundreds of individual JSON objects separated by `\n`,
not one big document.  The solution is to partition the mmap into
newline-aligned chunks for Rayon — giving each thread a contiguous region
to work with — and then iterate over the individual lines within each chunk
before calling the parser.

**Chunk boundary alignment.**  The `split_at_newlines` function scans
forward from the nominal chunk end to the next `\n`, ensuring no line
is split across chunks.  Lines whose trailing `\n` falls past the end
of file are still handled correctly.

**`StringCounter` accumulation.**  Each Rayon task returns a
`StringCounter`; `reduce` combines them with simple integer addition,
avoiding any shared state or locking.

### Results

On a 12.7 MB test file (200 k lines) split into 13 chunks:

```
keys   found : 600000   (3 keys/line × 200 000 lines)
strings found: 200000   (1 string value/line × 200 000 lines)
```

### Commit

`6a055ea` examples: add mmap_parallel JSON-lines parallel SAX counter

## Session — mmap_parallel self-generates its test file

### What was done

Reworked `examples/mmap_parallel.rs` so that `main` first creates
`/tmp/file.jsonl` (1 GiB, ~10.7 million lines) before mapping and parsing
it, removing the CLI path argument entirely.

Each generated line is exactly 100 bytes including the trailing `\n`:

```
{"identifier":"user000000000000","description":"item000000000000","subcategory":"type000000000000"}
```

Keys: "identifier" (10), "description" (11), "subcategory" (11) — all ≥ 10 chars.
Values: 16-char strings (4-char prefix + 12-digit line index) — all ≥ 10 chars.
1 073 741 800 bytes ÷ 100 = 10 737 418 lines → exactly 1 GiB.

### Design decisions

Used a `BufWriter` with a 4 MiB buffer for fast sequential writes.  The
format string uses escaped braces (`{{`/`}}`) in a regular string literal
rather than a raw string, avoiding raw-string delimiter collisions.

`Instant` timing is printed separately for file generation and for parsing.

### Results

On this machine:

- File generation: 1.15 s
- Parse (1024 × ~1 MiB chunks, Rayon + AVX-512BW): 36 ms
- keys found: 32 212 254  (10 737 418 × 3)
- strings found: 32 212 254  (10 737 418 × 3)

### Commit

`0f70426` examples: mmap_parallel generates its own 1 GiB test file

## Session — Push unescape responsibility to Sax implementors

### What was done

Changed the contract of `Sax::escaped_string` and `Sax::escaped_key`: they
now receive the **raw** (still-escaped) `&str` slice directly from the
source JSON rather than a pre-decoded string.  Callers that need the decoded
text call `unescape_str` themselves.

Changes across the codebase:

- `src/sax.rs`: docstrings updated to document the raw-string contract.
- `src/lib.rs`:
  - `parse_json_impl` — removed `unescape_buf: &mut String` parameter;
    passes `raw` / `current_key_raw` directly to `writer.escaped_string` /
    `writer.escaped_key`.
  - `parse_with` — no longer creates or passes an `unescape_buf`.
  - `parse_with_zmm` — no longer creates or passes an `unescape_buf`;
    `parse_json_zmm_sax` extern signature loses its last parameter.
- `src/dom/mod.rs`: `DomWriter::escaped_string` and `escaped_key` now
  allocate a local `String`, call `crate::unescape_str`, and convert to
  `Box<str>` internally.  They are the only implementations that need
  decoded text.
- `asm/x86_64/parse_json_zmm_sax.S`:
  - Function signature comment: removed `unescape_buf / r9` argument.
  - Prologue: removed `mov [rbp+LOC_UNESCAPE], r9` save.
  - `.Lsc_emit_escaped`: removed `call unescape_str` and String field
    reads; now calls `VTAB_ESCAPED_STRING(rbx, rsi, rdx)` directly with
    the raw ptr/len that were already in rsi/rdx.
  - `.Lke_emit_escaped`: same — calls `VTAB_ESCAPED_KEY` directly with
    `LOC_KEY_PTR` / `LOC_KEY_LEN`.
  - Removed `LOC_UNESCAPE` .equ constant and its stack-layout comment.
- `asm/x86_64/parse_json_zmm_dom.S`: **unchanged** — the DOM assembly
  still calls `unescape_str` + `dom_take_box_str` internally and never
  goes through the `Sax` trait for escaped entries.

### Design decisions

The previous design coupled the escape-decoding step to the parser
internals: `parse_json_impl` always allocated/cleared a `String` and ran
`unescape_str` before calling the trait method, even for implementations
that only count strings and discard the content.  Moving the call into
`DomWriter` (the only implementation that needs decoded text from the SWAR
path) eliminates that allocation for all other `Sax` implementations and
simplifies the assembly SAX path by ~12 instructions per escape event.

### Results

All 29 unit tests pass.

### Commit

`9c2d164` refactor: escaped_string/escaped_key receive raw source &str; DomWriter unescapes internally

## Session: remove unescape_buf from parse_json_zmm_dom

### What was done

Removed the `unescape_buf: *mut String` parameter from `parse_json_zmm_dom`,
the hand-written AVX-512BW DOM assembly parser.  Previously the caller
(`parse_to_dom_zmm`) had to allocate a `String`, pass its raw pointer in as
the 7th argument, and the assembly would call `unescape_str` (to fill it) and
then `dom_take_box_str` (to box the result) at each escaped-string/key site.

### Design decisions

The two-call sequence (`unescape_str` + `dom_take_box_str`) was collapsed into
a single new function `dom_unescape_to_box_str(raw_ptr, raw_len, out_ptr,
out_len)` that allocates its own `String` internally, decodes the escapes, and
writes the `Box<str>` pointer and length to the caller-supplied output
pointers.  This is the same pattern that stabilised the SAX path in the
previous session — the caller no longer owns an escape buffer.

`dom_take_box_str` was deleted; `unescape_str` is still present and public
(called from `dom_unescape_to_box_str` and directly from
`DomWriter::escaped_string/escaped_key`).

On the assembly side `LOC_UNESCAPE` (stack slot `[rbp-48]`) is removed and
the two external calls at `.Lsc_emit_escaped` and `.Lke_emit_escaped` are each
replaced by a single `call dom_unescape_to_box_str`.  The `has_escapes_out`
and `tape_cap` stack arguments shift from `[rbp+24]/[rbp+32]` to
`[rbp+16]/[rbp+24]` following the removal of the 7th argument.

### Results

29/29 tests green.  No benchmark regression expected (same allocation pattern,
one fewer external call per escaped token).

### Commit

`da9e8aa` refactor: remove unescape_buf from parse_json_zmm_dom; add dom_unescape_to_box_str

## Session: recompute DRAM bandwidth with ZMM loads

### What was done

Added `examples/mem_bw_zmm.rs` — a standalone memory-bandwidth benchmark that
allocates a 2 GiB, 64-byte aligned buffer (via `std::alloc::alloc`), touches
every page to force physical backing, then runs 8 sequential passes with two
AVX-512 strategies and reports best and median GiB/s:

* **zmm**`vmovdqu64` (temporal) loads via `_mm512_loadu_si512`.
* **zmm-nt**`vmovntdqa` (non-temporal streaming) loads via
  `_mm512_stream_load_si512`; bypasses CPU read-allocate.

Both strategies OR all loaded vectors into a 512-bit accumulator and store it
to prevent dead-code elimination.

### Design decisions

The previous bandwidth estimate (~45 GiB/s) was measured with scalar reads.
Using ZMM loads gives the prefetcher/memory controller a better chance to
stream at full width, which is more representative of what the AVX-512BW
parser actually exercises.  2 GiB ensures the working set is far larger than
the 64 MB L3 cache so results reflect DRAM, not cache, bandwidth.

Non-temporal loads (`vmovntdqa`) require 16-byte alignment (64-byte for
ZMM); the 64-byte aligned allocation guarantees this.

### Results

Ryzen 9 9955HX (Zen 5, DDR5 dual-channel):

| Strategy                 | Best      | Median    |
|--------------------------|-----------|-----------|
| zmm temporal (`vmovdqu64`) | 47.5 GiB/s | 47.4 GiB/s |
| zmm-nt (`vmovntdqa`)     | 49.5 GiB/s | 47.9 GiB/s |

The parallel JSON parser (26.6 GiB/s) reaches ~56 % of the ZMM temporal
ceiling, up from the ~59 % figure that used the now-known-underestimated
scalar baseline.

README updated: replaced "45 GiB/s scalar" row with the two ZMM rows and
revised efficiency to ~56 %.

### Commit

`b50d502` example: add mem_bw_zmm — ZMM temporal and NT load bandwidth benchmark

## Session N — serde example timing

### Add timing to serde_example

Added `std::time::Instant` timing around the two main phases in `run()`:

- `parse_to_dom` (or `parse_to_dom_zmm` on AVX-512BW): measures the JSON
  parse + tape-build time.
- `from_taperef`: measures the serde deserialization walk over the tape.

Both durations are printed in milliseconds alongside the existing record
count output.  Example output on Ryzen 9 9955HX with a ~1 MiB input:

```
parse_to_dom_zmm + from_taperef  (AVX-512BW): decoded 8066 records, last id=8065
  parse_to_dom: 1.212 ms  |  from_taperef: 1.755 ms
```

### Design decisions

`Instant::now()` / `elapsed()` is sufficient accuracy for a single-run
example (sub-100 µs jitter on this machine).  No warmup loop or statistics
are needed; users who want repeatable microbenchmarks should use the Criterion
benches.

### Commit

`dd4bbe7` example: add parse_to_dom and from_taperef timing to serde_example

## Session — use dom_parser / sax_parser in examples

### What was done

Updated the three remaining examples (`sax_example.rs`, `dom_example.rs`,
`mmap_parallel.rs`) to use the safe CPUID-dispatching helpers `sax_parser()`
and `dom_parser()` introduced in `3fa487c`.

`SaxParser` was given `#[derive(Copy, Clone)]` so it can be captured by value
in Rayon parallel closures (`mmap_parallel.rs`).

In `mmap_parallel.rs` the two separate `parse_line_into_zmm` /
`parse_line_into_rust` helper functions and the manual `is_x86_feature_detected!`
guard inside `parse_chunk` were removed.  `parse_chunk` now takes a
`SaxParser` argument; `sax_parser()` is called once in `main()` and the
`Copy` value is captured by the Rayon closure.

### Design decisions

Removing the two helper functions reduces the example from ~235 lines to
~175 lines with no loss of functionality.  Passing `SaxParser` by value (copy)
into `parse_chunk` is idiomatic for a trivially-copyable handle; an alternative
would be passing `&SaxParser` and relying on `Sync`, but by-value is cleaner
when `Copy` is available.

`serde_example.rs` was already updated to use `dom_parser()` in `3fa487c` and
required no further changes.

### Results

All 29 library tests pass.  Both `sax_example` and `dom_example` run correctly,
and `mmap_parallel` compiles cleanly with zero warnings.

### Commit

`f79d93d` use dom_parser/sax_parser in examples, add Copy+Clone to SaxParser

## Session — serde_example combined MiB/s + serde_json comparison

### What was done

Reworked `serde_example.rs` to report a single combined MiB/s for the
`parse_to_dom` + `from_taperef` pipeline, and added a `run_serde_json`
function that times `serde_json::from_str::<Vec<Record>>` end-to-end on
the same data for a direct comparison.

The per-step breakdown (parse ms / serde ms) is still printed alongside the
combined figure so both are visible.

### Design decisions

`serde_json` was already a dev-dependency (used in the benchmarks), so no
new dependency was required.  The combined time is computed as
`parse_elapsed + serde_elapsed` using `std::ops::Add` for `Duration`, which
keeps the arithmetic exact and avoids floating-point rounding between the two
measurements.

### Results

On this machine (AVX-512BW):
```
asmjson (AVX-512BW)    : 8066 records  |  parse: 0.9 ms  serde: 7.5 ms  combined: 8.4 ms  (97 MiB/s)
serde_json:          8066 records  |  combined: 24.1 ms  (34 MiB/s)
```

### Commit

`16ce530` serde_example: combined MiB/s + serde_json comparison

## Session — Repository moved to atomicincrement org

### Move to atomicincrement GitHub organisation

**What was done** — The repository was transferred from the personal
`andy-thomason` GitHub account to the `atomicincrement` organisation.
All in-repo references to the old URL were updated:

- `Cargo.toml` `repository` field → `https://github.com/atomicincrement/asmjson`
- `README.md` CI badge URL and LICENSE link → `atomicincrement/asmjson`

**Design decisions** — Only canonical GitHub URLs embedded in source were
changed; local filesystem paths (e.g. in this log) were left as-is since they
reflect the machine's directory layout, not the remote.

**Results** — No functional changes; purely metadata/URL updates.

**Commit** — pending

## Session — Fix CI failures (SIGILL + doctest arity)

### Fix AVX-512 SIGILL in tests on GitHub Actions runners

**What was done** — The CI was failing with `SIGILL: illegal instruction`
because the `#[cfg(target_arch = "x86_64")]` zmm test helpers
(`zmm_dom_matches`, `zmm_dom_rejects`, `zmm_sax_matches`) and the
`zmm_dom_overflow_retry` test called AVX-512 assembly unconditionally.
GitHub Actions `ubuntu-latest` runners do not have AVX-512 hardware.
A runtime guard `if !is_x86_feature_detected!("avx512bw") { return; }` was
added at the top of each of the four affected functions so they skip silently
on non-AVX-512 machines instead of crashing with an illegal instruction.

**Design decisions** — Early-return skipping (rather than `#[ignore]`) keeps
the tests in the normal `cargo test` run and makes them self-activating on any
AVX-512-capable machine.  The public API already used `is_x86_feature_detected`
for dispatch; the tests now mirror that pattern.

### Fix doctest arity for parse_to_dom

**What was done** — Two code examples in `README.md` called
`parse_to_dom(src)` with one argument, but the function signature changed to
`parse_to_dom(src, initial_capacity: Option<usize>)` in a prior session.
Both invocations were updated to `parse_to_dom(src, None)`.

**Results** — All 29 unit tests and all 9 (+ 2 ignored) doctests pass locally
and CI is expected to pass on `ubuntu-latest` (no AVX-512, zmm tests skip).

**Commit** — e4e68f4 fix: gate zmm tests on avx512bw; fix doctest arity; bump to 0.2.3

## Session — First draft JOSS paper

### Write paper.md and paper.bib for Journal of Open Source Software

**What was done** — Created a first-draft JOSS submission in
`doc/paper/paper.md` (Pandoc Markdown, JOSS format) and `doc/paper/paper.bib`
(BibLaTeX references), following the JOSS paper guidelines fetched from
joss.readthedocs.io.  The paper (~1200 body words) covers:

- **Summary** — high-level description for a non-specialist audience.
- **Statement of need** — JSON parsing as a data-pipeline bottleneck; gap
  between library performance and hardware capability.
- **State of the field** — comparison with simdjson/Langdale & Lemire,
  simd-json, sonic-rs, serde_json; explains how asmjson differs
  (AVX-512BW, direct threading, DOM-in-assembly).
- **Software design** — two assembly listings (vcmp classify chunk, tzcnt
  whitespace skip), tape DOM design, SWAR fallback, API surface.
- **Research impact statement** — crates.io release; future directions
  (CSV/TSV, compact tape, proc-macro SAX deserialiser).
- **AI usage disclosure** — GitHub Copilot assisted; assembly hand-authored.

**Design decisions** — JOSS requires Pandoc Markdown, not raw LaTeX; the
toolchain converts via ConTeXt to PDF.  Assembly listings are fenced code
blocks (rendered as listings in the PDF).  Word count ~1496 including YAML
front matter, well within the 750–1750 word target.

**Results** — `doc/paper/paper.md` and `doc/paper/paper.bib` committed.

**Commit** — 13e4b59 docs: first draft JOSS paper (paper.md + paper.bib)

## Session — JOSS paper iterations and 0.2.4 release

### Paper pre-flight fixes

**What was done** — Several pre-flight issues in `doc/paper/paper.md` were
identified and corrected:

- ORCID `0000-0001-8240-1614` added to author metadata.
- Typo "instumental" → "instrumental" fixed in AI Usage Disclosure section.
- Missing citations added: `[@serde-json]` on first mention of `serde_json`;
  `[@rayon]` on first mention of Rayon parallel tasks.
- SWAR section heading expanded to "Portable SWAR (SIMD-Within-A-Register) fallback".
- `CONTRIBUTING.md` created with issue reporting, patch submission, assembly
  conventions, and support guidance; a "Contributing" section added to README
  pointing to it (required by JOSS reviewer checklist).
- `doc/paper/JOSS_checklist.md` created capturing the pre-flight status table
  and the official JOSS reviewer checklist items as a reference.

**Design decisions** — The `# References` heading in the paper is a Pandoc
placeholder; no manual entries are needed there. All six `[@key]` citations in
the paper have matching entries in `paper.bib`; `json-rfc` in the bib is
unused but harmless.

**Results** — Prose word count 1501 (JOSS limit 1750). All required sections
and metadata present. All citation keys resolve.

**Commit** — bump to 0.2.4 (b4478be), Cargo.lock (8012cdf)

---

## Session — carry-free eq_byte fix and tokenizer.json tests

### Add tokenizer.json integration tests and fix SWAR false-positive bug

**What was done** — Two new tests were added to `src/lib.rs`:

- `dom_parse_tokenizer_json`: reads the 6 MB
  `Qwen2.5-0.5B/tokenizer.json` file and parses it with `parse_to_dom`,
  asserting that the root value is an object.
- `dom_parse_double_backslash_strings`: uses a helper
  `json_strings_with_double_backslash` that walks the raw JSON source and
  extracts every quoted-string token whose raw bytes contain at least one
  `\\` (double-backslash) escape sequence.  Each extracted token is then
  parsed as a standalone JSON string value; all 790 such tokens must parse
  successfully.

Running the tests revealed that both failed.  The tokenizer.json parse
failed because the `classify_u64` SWAR byte-equality helper had a
false-positive bug; the token-level tests pinpointed the exact inputs
triggering it.

**Design decisions** — The root cause was in the `has_zero_byte` trick
used by `eq_byte`:

```
(v − 0x0101…) & !v & 0x8080…
```

This well-known algorithm has false positives: when byte `i` equals zero
(a genuine match) and byte `i+1` has value `0x01`, the borrow generated
by subtracting `0x01` from `0x00` propagates into position `i+1`, setting
its high bit and causing it to be reported as a match.

Concretely, the `"` byte (0x22) immediately followed by `#` (0x23 =
0x22 + 1 = 0x22 XOR 0x22 + 1 → XOR result 0x01) triggered the
false positive.  The parser would land on position 1 (`#`) thinking it was
a closing quote, then dispatch the `_ =>` arm of the StringChars match,
which closes the string early and returns `AfterValue`.  The subsequent
backslashes caused an `Error`.

The fix replaces `eq_byte` with a carry-free formula:

```
let x = v ^ broadcast(b);
!(((x & 0x7f7f…) + 0x7f7f…) | x) & 0x8080…
```

For a byte `x_i`:
- If `x_i == 0` (genuine match): `(0 & 0x7f) + 0x7f = 0x7f`; `0x7f | 0 =
  0x7f`; `!0x7f & 0x80 = 0x80`. ✓
- If `x_i == 0x01` (previously problematic): `(1 & 0x7f) + 0x7f = 0x80`;
  `0x80 | 0x01 = 0x81`; `!0x81 & 0x80 = 0`. ✓ No false positive.
- For any other non-zero byte: `(x_i & 0x7f) + 0x7f ≥ 0x80`, so bit 7
  is set; `!(…) & 0x80 = 0`. ✓

The addition `(x & 0x7f7f…) + 0x7f7f…` is carry-free across bytes
because each summand is at most `0x7f` and the maximum sum `0x7f + 0x7f =
0xfe < 0xff`, so no carry ever crosses a byte boundary.

**Results** — All 31 unit tests pass.  Both new integration tests pass.
The full test suite is green.

**Commit** — e77a521 fix: carry-free eq_byte to eliminate false-positive quote detection