badness 0.1.0

An LSP, formatter, and linter for LaTeX
Documentation
# badness --- Roadmap

A LaTeX formatter, linter, and language server on a lossless rowan CST,
mirroring **arity** (`../arity`, the same tool for R). See `AGENTS.md` for
load-bearing design decisions, invariants, and the copy-from-arity strategy.

Single-crate package (not a workspace). Parser and formatter are **intentionally
interleaved**: the formatter is the primary tool for stress-testing the parser.

Files marked **\[copy\]** are lifted \~wholesale from arity; **\[rewrite\]** are
LaTeX-specific; **\[diverge\]** intentionally differs from arity.

Status: `[ ]` todo · `[~]` in progress · `[x]` done

--------------------------------------------------------------------------------

## Where we are

Phases 0--4 are done: a lossless, error-tolerant recursive-descent parser over a
rowan CST; `badness format` (parse → Wadler IR → print) with whitespace
normalization, environment + group/argument indentation, and paragraph reflow
(`--wrap`, default `Reflow`, via the `Ir::Fill` node); salsa incrementality + a
semantic layer (label/ref model, signature DB, project include graph); and a
minimal salsa-backed LSP (`badness lsp`: full-document formatting + pushed
parser diagnostics).

Prose-argument reflow has landed: under `WrapMode::Reflow`, an argument the
signature DB marks `prose` (a `\footnote`/`\caption` body, a sectioning title)
reflows to the line width — joined when short, wrapped when long, via a soft
`Ir::group` around the paragraph-fill engine (`formatter/core.rs`:
`lower_command`/`lower_prose_group`). `ArgSpec` grew a per-position `prose` flag
(object form in `data/signatures.json`); the table is a conservative starter set
(text-formatting commands, footnotes/captions, sectioning titles) and is
incrementally tunable. Non-prose groups (`\newcommand` body, `\label`) are left
exactly as authored.

**Next up --- pick by priority:**

- *Formatter:* `Sentence`/`Semantic` wrap modes (port panache's sentence rules /
  sembr; both fall back to `Preserve` today) — *demoted, much later*. Or: widen
  the prose-argument table (CWL ingest could feed it), and consider gluing a
  prose arg onto its command line when a source break separates them.
- *LSP:* `--wrap`/config over LSP (today `EditorSettings` carries only
  `line_width`/`indent_width`; `wrap` is hardcoded `Reflow`); README
  editor-wiring docs. (The `format_node(tree)` cached-tree reuse is already done
  `lsp.rs` `compute_format`.)
- *Hardening:* the `latexindent` differential formatter oracle (more useful now
  that reflow — including prose args — has landed).

Use formatter ambiguities to drive parser fixes (AGENTS.md tenet 3). The
differential oracles --- `latexindent` (formatter) and texlab/tree-sitter-latex
(parse) --- remain available as hardening tracks.

--------------------------------------------------------------------------------

## Phases

- [x] **Phase 0 --- Foundations.** Module layout, core deps, `syntax.rs`,
  `text/line_index.rs`, `parser/events.rs` + `tree_builder.rs`, lossless
  lexer, round-trip harness, insta scaffolding, `Taskfile.yml`.

- [x] **Phase 1 --- Core parser.** Event-stream recursive descent → green tree;
  side-channel diagnostics; paragraphs, control sequences, groups, comments,
  environments (with mismatch recovery), greedy argument grouping, math
  (`$…$`, `$$…$$`, `\[…\]`, `\(…\)`), `\verb`/verbatim lexer modes,
  `\makeatletter` letter-mode; recovery anchors + progress guarantee;
  losslessness asserted; texlab differential parse oracle.
  Open follow-ups:
  - [ ] Argument-taking verbatim envs (`lstlisting`/`minted`/`Verbatim`) ---
    needs the signature DB to know where the raw body starts.
  - [ ] Structured math model (scripts/delimiters) --- currently flat tokens
    (Phase 5).
  - [ ] Block-vs-inline refinement: a lone block env is wrapped in a
    `PARAGRAPH`; the signature DB can later avoid that.

- [x] **Phase 2 --- CLI + formatter MVP.** `badness format` (parse → Wadler IR →
  print); **\[copy\]** IR + printer engine; whitespace normalization,
  environment + group/argument indentation (printer-owned, idempotent),
  paragraph reflow (`WrapMode`, `Ir::Fill`); protected regions untouched;
  invariants (idempotence, parse-stability, losslessness) asserted.

  Open follow-ups:

  - [~] `build.rs` man/completions/markdown
    (clap_mangen/\_complete/clap-markdown). **\[copy\]** --- the `format`
    subcommand lives in `main.rs`; `build.rs` still deferred.
  - [ ] Directory-walking file discovery for `format` (today: explicit
        paths).

- [x] **Phase 3 --- Salsa + semantic layer.** `incremental.rs` salsa harness;
      `semantic_model` (flat label/ref def-use model, `Eq`-backdating); built-in
      signature DB (`data/signatures.json`); project include graph (`\input`/
      `\include`/`\import`/`\subfile`, salsa firewall + reachability/cycles).
      Open follow-ups:
      - [x] `\newcommand`/`\newenvironment`/`xparse` signature scanning
            (signatures only, no execution) feeding the semantic DB.
            `semantic/define.rs` scans the braced-name forms into a per-document
            `SignatureDb`; `semantic/xparse.rs` parses the full xparse arg-spec
            grammar; `Signatures` overlays scanned over built-in
            (scanned-first), and the formatter's `\begin` arity glue consumes
            it. Remaining: the unbraced form `\newcommand\foo…` (parses with
            `\foo` as a sibling, so skipped — needs scanner-side sibling
            heuristics, not parser changes); a salsa `document_signatures` query
            once an LSP consumer (hover/ completion) wants the scanned command
            sigs.
      - [ ] Cross-file label resolution (`file_labels` firewall → project-level
            `resolved_labels`) + duplicate-label / undefined-ref diagnostics.
            Today's `unreferenced_labels`/`unresolved_refs` are per-file
            *facts*, not lints.
      - [ ] CWL corpus ingest (an import format converted *into* the signature
            schema) once ecosystem breadth (e.g. LSP completion) needs it.

- [x] **Phase 4 --- Minimal LSP.** `src/lsp.rs` + `badness lsp` subcommand:
      single-threaded, salsa-backed `lsp-server` loop **\[diverge\]**;
      lifecycle, full-document sync, `textDocument/formatting`,
      `publishDiagnostics`; stdio smoke test.

- [ ] **Phase 5 --- Math.**
  - [ ] Structured math model over the generic math tree.
  - [ ] Precedence-climbing for `^`/`_` binding and primes (the one Pratt
        site).
  - [ ] `\left … \right` delimiter matching.
  - [ ] Alignment-aware formatting: `align`, `matrix`/`pmatrix`, `&`
        columns, `\\` rows.

- [] **Phase 6 --- Linter.** `badness lint` + `linter/{diagnostic,render}`
     surface parse diagnostics; annotate-snippets render done.

  - [ ] `linter/suppression` (`% badness-ignore` style). **\[copy shape\]**
  - [ ] Lints: unmatched delimiters, undefined/duplicate refs, deprecated
          commands, stylistic checks.
  - [ ] Autofix infra; enforce "autofixes never introduce formatting errors"
          (Tenet 5).

- [ ] **Phase 7 --- Full LSP.** Range formatting; linter diagnostics published
      alongside parse errors; document symbols + folding; hover + completion
      from the signature DB; go-to-definition / rename for labels and refs;
      incremental `didChange` sync.

- [ ] **Phase 8 --- Performance & hardening.**
  - [ ] Extract shared crate(s) from the **\[copy\]** files (IR engine
        first), depended on by both badness and arity.
  - [ ] Intra-file incremental reparse (reuse green subtrees on contained
        edits).
  - [ ] Fuzzing (losslessness must hold on arbitrary input).
  - [ ] Large-doc benchmarks (`hyperfine`, criterion); flamegraph hot paths.
  - [ ] `wasm32` build for a web playground.

- [ ] **Phase 9 --- BibTeX / BibLaTeX.** Parser (likely a `bib.rs` module, maybe
      its own crate); formatter + linter rules; LSP support; salsa incremental
      parsing + semantic model integrated with the LaTeX project graph (resolve
      `\bibliography` references).

--------------------------------------------------------------------------------

## Open questions / decisions to revisit

- [ ] Trivia-attachment policy (leading vs. trailing) --- pick one, document it.
- [ ] How much of `\newcommand` / `xparse` to model for the signature DB.
- [ ] Formatter opinionatedness: which choices are configurable vs. fixed.
- [ ] Whether arity should also migrate tower-lsp-server → lsp-server (separate
      decision; out of scope for badness, but the `AGENTS.md` rationale
      applies).