# AGENTS.md
Guidance for AI agents working with Badness, a formatter, linter, and
language server for LaTeX.
Badness follows **rust-analyzer's** architecture: rowan CST + event-stream parser
+ salsa + a Wadler-style formatter IR. (We were also inspired by
[arity](https://github.com/jolars/arity), the same kind of tool for R.) Extended
rationale for the decisions below lives in TODO.md ("Design notes").
## What this project is
Badness parses LaTeX into a **lossless concrete syntax tree (CST)** and builds a
**formatter** (`badness format`), a **linter** (diagnostics), and a **language
server** (LSP) on top. The architecture follows **rust-analyzer**: a generic,
error-tolerant, hand-written parser producing a lossless tree; semantics layered on
top as a separate concern; incremental recomputation via salsa.
**Single-crate Cargo package** (`badness`, edition 2024), *not* a workspace. Module
folders: `parser/`, `formatter/`, `linter/`, `semantic/`, `project/`, `text/`, plus
`syntax.rs` and `incremental.rs`.
## Tenets
1. **Deterministic, rule-based formatting.** Layout is decided solely by the
formatter's rules and the layout engine—the formatter is the **sole authority on
layout**. Push back against hard-coding special cases. Autofixes are textual edits
that never invoke the formatter: a fix decides *what* to rewrite, never *how to lay
it out*, and owes only correctness (the result still parses and is still lossless),
never line-width. When a fix can't meet that bar for some shape, make it correct by
construction or withhold it for that shape (still report the finding). The pipeline
is fix-then-format; don't run the formatter inside `--fix`.
2. **Incremental parsing is first-class.** Parser/CST work must keep the salsa-based
reparse path (`incremental.rs`) viable.
3. **Parsing is the parser's job.** Never paper over parser mistakes in the formatter,
and never let parsing logic creep into the formatter. If the formatter hits
something the parser got wrong, fix it in the parser.
4. **Losslessness is the parser's job.** `reconstruct(text) == text`, always. The
formatter may assume a lossless CST.
## Core architectural decisions
Load-bearing. If a change pushes against one, raise it explicitly. Extended rationale
for the sanctioned lexer modes is in TODO.md ("Design notes").
1. **The parser treats input as generic TeX surface syntax and always produces a
lossless tree.** It never *requires* resolving macros or catcodes—in full
generality that is equivalent to running a TeX engine. We do **not** implement
macro expansion or a TeX evaluator. Anything we cannot statically resolve degrades
to generic nodes (plus a diagnostic where useful), never a crash or corruption.
We **do** handle a bounded, growing set of *statically recognizable* patterns as
lexer modes or semantic enrichment, all reading static facts only (no macro meaning
resolved):
- **Letter modes.** `\makeatletter`/`\makeatother` (`@` is a letter) and
`\ExplSyntaxOn`/`\ExplSyntaxOff` (expl3: `_` and `:` are letters; also opened by
`\ProvidesExplPackage`/`Class`/`File`). Independent flags that compose.
- **Verbatim.** `\verb`/verbatim-like environments and verbatim-argument commands
capture their opaque body or final argument as a single token, using the
signature DB (`data/signatures.json`) for argument shape. Built-ins are curated;
**user-defined** verbatim commands are discovered by the definition scanner
(`semantic::define`) via a **bounded two-pass parse** (pass 1 fingerprints
catcode-changing definitions, pass 2 re-lexes with those names). Conservative by
construction—a false positive suppresses real diagnostics, so prefer false
negatives.
- **`\left`/`\right` delimiter isolation.** The following delimiter is emitted as
its own token; the `LEFT_RIGHT` pair is then built by the parser.
- **Signatures.** `\newcommand`/xparse *signatures* are extracted into the semantic
DB, never executed.
**expl3 code formatting (formatter-side, sanctioned).** The expl3 letter mode above is
a *lexer* fact. The matching *whitespace* catcodes—inside an expl3 region (`\ExplSyntaxOn`
…`\ExplSyntaxOff`, or `\ProvidesExpl*` to EOF) source spaces/tabs are catcode 9 (ignored)
and `~` is catcode 10 (a literal space)—are a **formatter** concern: since inter-token
whitespace is provably insignificant, the formatter owns the layout of in-region code
(indentation + line breaks), **regardless of `WrapMode`**. This is **idempotent by
construction**: the inserted whitespace is itself catcode-insignificant, so re-lexing the
output yields the same token sequence and the deterministic layout is a fixed point. It is
the property the generic "hanging continuation indent" (TODO.md, the flush-B/TikZ problem)
could not get, supplied here at the catcode level. Region membership is **not** recorded in
the CST: the lexer's expl3 toggle stays transient, and the formatter recomputes in-region
byte ranges in a read-only pre-pass (`formatter::core::expl3_regions`) over the same fixed
toggle set the lexer uses (`parser::lexer::expl_toggle`, shared so the two never drift),
stored as a `Vec<TextRange>` side channel in `LowerCtx`—the same byte-range pattern as
parser diagnostics (decision #4). The CST, lexer, events, and tree_builder are untouched, so
losslessness is unaffected; the reformatted output is a different valid text with the same
meaning. Statement boundaries follow *source newlines* (the expl3 one-call-per-line
convention; a multi-token call like `\cs_new:Npn \foo:n #1 {…}` is several sibling CST
nodes, not one structural unit), and a single inserted space at any preserved token boundary
keeps re-lexing from merging two tokens.
2. **Two layers: syntactic vs. semantic.** The *syntactic* layer is the generic CST
and knows nothing about what a command means. The *semantic* layer is a
**signature database** (built-in table + CWL-style data + `\newcommand`/
`\newenvironment` scanning) assigning arity, verbatim-ness, and sectioning.
**Meaning never leaks into the parser** (the verbatim-body exception in
decision #1 reads static argument-shape data only).
3. **Hand-written recursive descent is the spine; Pratt is local to math.** Use
precedence-climbing *only* for sub/superscript binding (`^`, `_`) and `\left…\right`
matching. The text-level parser has no precedence.
4. **Parser emits an event stream, not a tree directly.** `lexer → flat token stream →
parser emits events (Start/Tok(idx)/Finish) → tree_builder re-attaches trivia and
feeds rowan's GreenNodeBuilder`. Tokens are referenced by index; there is **no
`Error` event**—diagnostics ride a side channel keyed by byte range (the
rust-analyzer event-stream pattern).
5. **Errors travel alongside the tree, never abort it.** A single syntactic error
never fails the whole parse. Recovery anchors: `\end{…}`, `\begin`, blank line, `}`,
`$`, `&`, `\\`. Always make progress; never infinite-loop on unexpected input.
6. **Incrementality is salsa-first.** Cross-file/cross-query incrementality via salsa
is the v1 story. Intra-file incremental reparse (reusing green subtrees) is a
*later optimization*—a whole-file reparse of a typical `.tex` is sub-ms.
7. **Store green nodes in salsa, never red (`SyntaxNode`).** Red trees aren't
`Send`/`Eq`/`salsa::Update`. See `incremental.rs`: `#[salsa::input]
SourceFile { text }`, a `parsed_document` query returning `rowan::GreenNode` +
diagnostics under `no_eq, unsafe(non_update_types)` (sound because the tree is a
pure function of the text), materializing red cursors on demand.
8. **Argument grouping is greedy and generic.** The CST greedily attaches trailing
`{…}`/`[…]` groups as argument nodes (texlab-style). Arity is unknown at parse time;
the semantic layer refines it.
9. **Trivia attachment follows the rust-analyzer rule: comments bind *forward*,
whitespace floats, blank lines break the bind.** Trivia is never dropped, so the
only question is which node owns it:
- **Default: float at the nearest enclosing node**—inter-sibling whitespace and
newlines stay direct children of the tightest containing block/group, owned by
neither neighbor.
- **A contiguous run of own-line `%` comments immediately preceding a `COMMAND` or
`ENVIRONMENT` binds *leading* into it**, grouped as a `DOC_COMMENT` node.
"Documentable" is decided purely on node kind—no signature-DB lookup leaks into
the parser. A same-line trailing comment (`\foo % x`) never binds.
- **A blank line (`≥2` newlines, the `\par` boundary) breaks the bind:** comments
past it stay floating.
Whitespace stays a bare leaf token (never wrapped); the bound leading-comment run
is the one named-node exception. This is a CST-shape convention enforced by tests,
not a hard oracle.
## Invariants (test oracles—enforce them)
- **Losslessness:** `reconstruct(text) == text`, byte-for-byte.
- **Idempotence:** `fmt(fmt(x)) == fmt(x)`.
- **Protected regions** (`verbatim`, `lstlisting`, `\verb`, comments) are never altered
by the formatter.
There is deliberately **no parse-stability invariant**: the formatter may *normalize*
structure (e.g. stripping redundant braces around a single-token math script,
`x^{2}` → `x^2`), changing CST shape on purpose. Such rewrites must preserve *meaning*
(carried by fixtures and the corpus) but are not held to structural equality with the
input. The formatter is intentionally used to stress the parser—any formatter
ambiguity should surface a parser modeling gap.
**Differential oracle:** use **texlab's parser** as a differential *parse* oracle over
a corpus—skeletonize both trees and compare. It is a reference we measure against,
never match.
## Technology choices
- **rowan** (`0.16`) for the CST; **salsa** (`0.26`) for incremental queries;
**smol_str** for interned token text; **insta** for snapshot tests;
**annotate-snippets** for diagnostics rendering.
- **LSP:** `lsp-server` + `lsp-types` (rust-analyzer's stack), **not**
`tower-lsp-server`. salsa cancellation is a synchronous unwind
(`salsa::Cancelled`) that composes with `lsp-server`'s sync main loop + threadpool
and fights tower-lsp's async `&self` model. `text/line_index.rs` uses
`lsp_types::Position`.
- **Formatter engine:** a Wadler/Prettier-style `Doc` IR
(`Group`/`Line`/`SoftLine`/`HardLine`/`EmptyLine`/`Indent`), plus an `Ir::Fill`
node (per-gap greedy break decisions) for paragraph reflow.
- **Paragraph line breaks** are controlled by a `WrapMode` (`Reflow` default,
`Sentence`, `Semantic`/sembr, `Preserve`), modeled on the sibling **panache**
formatter and mechanized through the `Doc` IR (`Fill`), not a separate line-filler.
`Reflow` and `Preserve` are implemented; `Sentence`/`Semantic` fall back to
`Preserve`. The `\\` line break (with a tightly-bound `*`/`[len]`) is grouped by the
*parser* into a `LINE_BREAK` node so the formatter sees `\\[2ex]` as one unit.
- **CLI:** `clap` + `build.rs` generating man pages, completions, and markdown
(`clap_mangen`, `clap_complete`, `clap-markdown`).
## Non-goals
- No general macro expansion, no TeX evaluator, no execution of TeX primitives or
arbitrary `\def` semantics. (Common `\newcommand`/`\newenvironment`/xparse
*signatures* may feed the semantic DB—extracted, never executed.)
- No general `\catcode` handling beyond the bounded patterns in decision #1.
- We never typeset.
## Repo conventions
- Edition 2024; the toolchain is pinned by `rust-toolchain.toml` (single source of
truth), consumed by `devenv.nix` and honored by CI. A `wasm32-unknown-unknown`
target is configured.
- **Run `cargo fmt` before committing**—the rustfmt git hook rewrites unformatted
files and aborts the commit otherwise. `clippy` warnings are errors:
`cargo clippy --all-targets --all-features -- -D warnings`.
- Task runner is `go-task` (`Taskfile.yml`). Performance is
first-class (`perf`, `cargo-flamegraph`, `hyperfine`, `cargo-show-asm`,
`cargo-llvm-cov` are in the dev shell)—benchmark before optimizing, never regress
losslessness for speed.
- New parser features need corpus + snapshot tests **and** a losslessness assertion.
- **Windows CI bites twice:**
- *Line endings.* The formatter emits **LF** and tests compare bytes against
checked-in fixtures. When you add a fixture in a new extension under
`tests/fixtures/**` or `tests/corpus/**`, add a matching `… eol=lf` line to
`.gitattributes` (the `*_crlf_*`/`*_lf_*` line-ending fixtures are the deliberate
`-text` exceptions). Never normalize line endings in code to pass a test—fix the
attribute.
- *URIs.* Decode LSP URIs to filesystem paths only through `uri_to_fs_path`/
`path_to_uri` (`lsp.rs`), which strips the `/` before a Windows drive letter and
keeps the Unix root. Keep `uri_to_fs_path_handles_unix_and_windows` green; tests
and snapshots must not assume `/` vs `\`.
- **Bib field DB** (`data/bib_fields.json`) tracks biblatex's canonical data model
(`blx-dm.def`). `scripts/gen_bib_fields.py` syncs the mechanical facts (entry-type
set, field categories, `required` constraints), preserving the hand-curated
`optional` ordering and classic-BibTeX overlay. `task bib-fields:check`/`:sync`
report/apply drift after a biblatex bump. Don't hand-edit the mechanical
facts—change them via the model and re-sync.
## Working agreements for agents
- Keep the syntactic layer free of semantic knowledge.
- Don't add intra-file incremental reparse, macro expansion, or catcode logic beyond
decision #1 without recording the decision here.
- Update TODO.md as phases progress; update this file when a decision changes.