nested-text 0.1.0

A fully spec-compliant NestedText v3.8 parser and serializer
Documentation
# CLAUDE.md — nested-text project guide

## What this is

nested-text (crate name `nested-text`, import as `nested_text`) is a Rust crate implementing the NestedText v3.8 data format. It provides parsing (`loads`/`load`), serialization (`dumps`/`dump`), and serde integration (`from_str`/`to_string`). It passes 100% of the official NestedText test suite. Licensed MIT OR Apache-2.0.

## Architecture

The design mirrors the Python reference implementation (https://github.com/KenKundert/nestedtext):

```
src/
  lib.rs       — public API re-exports
  lexer.rs     — line tokenizer (classifies lines into 7 types)
  parser.rs    — recursive descent parser (loads/load)
  inline.rs    — separate recursive descent for [...] and {...} syntax
  dumper.rs    — serializer to NestedText (dumps/dump)
  de.rs        — serde Deserializer (feature-gated)
  ser.rs       — serde Serializer (feature-gated)
  error.rs     — Error type with line/col/message
  value.rs     — Value enum (String, List, Dict)
```

### Data flow

**Parsing:** input string → `Lexer::new()` (classifies all lines upfront) → `Parser::read_value()` (recursive descent consuming lines) → `Value`

**Inline parsing:** when the parser hits a `[` or `{` line, it delegates to `InlineParser::parse()` which is a separate character-by-character recursive descent parser.

**Serialization:** `Value` → `render_value()` (recursive, builds lines) → joined string

**Serde:** parse to `Value` first, then walk the tree to drive serde Visitor (same two-phase approach as serde_json and toml crates).

## Key design decisions

- **Dict uses `Vec<(String, Value)>`** not HashMap — NestedText preserves insertion order.
- **Lexer classifies all lines upfront** into a Vec, then parser consumes via peek/next. This is simpler than a streaming approach and NestedText files are typically small (config files).
- **No parser combinator library (no nom)** — the format is simple enough that hand-written recursive descent produces better error messages.
- **serde is feature-gated** (`default = ["serde"]`) — users who only want the Value API skip the serde dependency.
- **Error messages match the Python reference implementation exactly** — this is validated by the test suite. Don't change error message wording without checking the test suite.

## NestedText gotchas learned during implementation

These are the tricky parts of the spec that caused bugs:

1. **"First tag wins"** — only the first occurrence of a tag character matters. A line like `-#:'>: -#:">:` is a dict item, not a list item, because `-` isn't followed by a space.

2. **Trailing whitespace IS a value**`key:  ` (two spaces after colon) has value `" "` (one space). Any indented content below it is an "invalid indentation" error because the value is already set.

3. **Partial dedent** — returning to an indentation level that was never established (e.g., indent from 0→4, then dedent to 2) is a specific error: "invalid indentation, partial dedent."

4. **Inline whitespace** — tabs and spaces are both valid whitespace inside inline `[...]` and `{...}`. But ONLY ASCII spaces are valid for indentation.

5. **Inline dict colon**`{a:0}` is valid (colon without space). `{:}` gives key="" value="". But `{a:0,}` (trailing comma) is "expected value." while `{a:0, }` (trailing comma+space) is "expected ':', found '}'."

6. **Unicode quotes in error messages** — inline parser errors use Unicode curly quotes (U+2018/U+2019: `'` `'`). Indentation errors use ASCII single quotes with escape sequences (`'\t'`, `'\xa0'`).

7. **BOM handling** — UTF-8 BOM (U+FEFF) must be stripped before parsing.

8. **Empty documents** — with `Top::Any` returns `None`. With `Top::Dict`/`List`/`String` returns an empty value of that type.

9. **Key serialization** — keys containing `: `, ending with `:`, with leading/trailing whitespace, or starting with tag characters (`- > : # [ {`) must use multiline key syntax (`: key_text`).

10. **Value serialization** — values with leading/trailing whitespace or newlines must use multiline string syntax (`> text`).

## Testing

```sh
cargo test                                    # all tests
cargo test --test official -- --nocapture      # official suite with counts
cargo test --lib                               # unit tests only
```

The official test suite is a git submodule at `tests/nestedtext_tests/`. The test harness (`tests/official.rs`) validates:
- **Load tests:** parsed output matches `load_out`, or error details (message, lineno, colno, line) match `load_err`
- **Roundtrip dump tests:** load → dump → load produces identical Value

If tests fail after changes, run with `--nocapture` to see which test names failed and what the mismatch was.

## Common maintenance tasks

### Updating the test suite
```sh
cd tests/nestedtext_tests
git pull origin master
cd ../..
cargo test --test official -- --nocapture
git add tests/nestedtext_tests
git commit -m "Update official test suite"
```

### Adding a new error type
1. Add variant to `ErrorKind` in `error.rs`
2. Return `Error::new(ErrorKind::NewVariant, "message matching reference impl.")` with `.with_lineno()`, `.with_colno()`, `.with_line()`
3. Check the Python reference for the exact error message wording
4. The test suite will catch mismatches

### Modifying the parser
The parser is sensitive to the order of checks. When modifying:
- Run `cargo test --test official -- --nocapture` after every change
- Pay attention to the distinction between "extra content" (leftover lines at depth 0), "invalid indentation" (unexpected deeper indent), and "partial dedent" (depth not in established levels)
- The `indent_stack` tracks current nesting; `all_indent_levels` tracks every level ever used (for partial dedent detection at top level)

### Modifying the dumper
The dumper must produce output that roundtrips correctly. After changes:
- `cargo test --test official run_roundtrip_dump_tests -- --nocapture` validates all 80 cases
- Check `key_requires_multiline()` and `value_needs_multiline()` — these are the safety valves that prevent ambiguous output

## Dependencies

- `serde` (optional, default on) — serialization framework
- `serde_json` (dev) — for test suite JSON parsing
- `base64` (dev) — for decoding test suite inputs