succinctly 0.7.0

High-performance succinct data structures for Rust
Documentation
# Parsing Documentation

This directory documents how the succinctly library parses structured data formats using semi-indexing techniques.

## Overview

Traditional parsers materialize data into memory structures (DOM trees, row arrays). Succinctly instead creates **semi-indices**: compact bit vectors that mark structural boundaries, enabling O(1) navigation without full materialization.

| Format  | Document              | Key Technique                         |
|---------|-----------------------|---------------------------------------|
| JSON    | [json.md]json.md    | Interest bits + balanced parentheses  |
| CSV/TSV | [dsv.md]dsv.md      | Marker bits + quote state tracking    |
| YAML    | [yaml.md]yaml.md    | Feasibility analysis (not implemented)|

---

## Semi-Indexing Approach

### Traditional Parsing

```
Input → Full Parse → DOM/Rows → Query
   10-50x memory overhead
```

### Semi-Indexing

```
Input → Bit Index → Lazy Navigation → Query
      ~3-4% memory overhead
```

**Benefits**:
- Memory efficient: 3-4% overhead vs 10-50x
- Fast construction: SIMD-accelerated
- Lazy evaluation: Decode values only when accessed
- Random access: O(1) rank, O(log n) select

---

## Common Techniques

Both parsers share optimization techniques:

| Technique              | JSON        | DSV          | Document                                                      |
|------------------------|-------------|--------------|---------------------------------------------------------------|
| SIMD character detect  | AVX2, NEON  | AVX2, NEON   | [simd.md]../optimizations/simd.md                           |
| Lookup tables          | PFSM tables | -            | [lookup-tables.md]../optimizations/lookup-tables.md         |
| State machine          | 4-state FSM | Quote toggle | [state-machines.md]../optimizations/state-machines.md       |
| Rank/select indices    | BP nav      | Field lookup | [hierarchical-structures.md]../optimizations/hierarchical-structures.md |
| Parallel prefix        | -           | Quote masking| [parallel-prefix.md]../optimizations/parallel-prefix.md     |
| PDEP/PEXT              | -           | toggle64_bmi2| [bit-manipulation.md]../optimizations/bit-manipulation.md   |

---

## Performance Summary

### Parsing Throughput

| Format | Platform    | CPU                       | Throughput   |
|--------|-------------|---------------------------|--------------|
| JSON   | x86_64 PFSM | AMD Ryzen 9 7950X (Zen 4) | ~880 MiB/s   |
| JSON   | ARM NEON    | Apple M1 Max              | ~570 MiB/s   |
| DSV    | x86_64 BMI2 | AMD Ryzen 9 7950X (Zen 4) | 1.3-3.7 GB/s |
| DSV    | ARM NEON    | Apple M1 Max              | ~800 MiB/s   |

### Memory Overhead

| Approach           | Overhead       |
|--------------------|----------------|
| Semi-index         | ~3-4% of input |
| DOM parser (JSON)  | 10-50x input   |
| Materialized (CSV) | 6-40x input    |

---

## Quick Links

- [JSON Parsing]json.md - Semi-index structure, PFSM, SIMD classification
- [DSV Parsing]dsv.md - Quote handling, PDEP carry propagation, lightweight index
- [YAML Analysis]yaml.md - Feasibility study, ambiguity analysis, oracle model

### Optimisation Techniques

- [simd.md]../optimizations/simd.md - AVX2, NEON character detection
- [lookup-tables.md]../optimizations/lookup-tables.md - PFSM state tables
- [state-machines.md]../optimizations/state-machines.md - Parser state machines
- [parallel-prefix.md]../optimizations/parallel-prefix.md - Quote state with prefix XOR
- [bit-manipulation.md]../optimizations/bit-manipulation.md - PDEP for 10x faster quotes
- [hierarchical-structures.md]../optimizations/hierarchical-structures.md - Rank/select indices
- [cache-memory.md]../optimizations/cache-memory.md - Why lightweight index wins

---

## References

- Langdale, G. & Lemire, D. "Parsing Gigabytes of JSON per Second" (2019)
- simdjson: https://github.com/simdjson/simdjson
- hw-json/hw-dsv (Haskell): https://github.com/haskell-works