# badness—Roadmap
A LaTeX formatter, linter, and language server on a lossless rowan CST,
following **rust-analyzer's** architecture. See `AGENTS.md` for load-bearing
design decisions and invariants.
Single-crate package (not a workspace). Parser and formatter are **intentionally interleaved**: the formatter is the primary tool for stress-testing the parser.
Status: `[ ]` todo · `[~]` in progress · `[x]` done
--------------------------------------------------------------------------------
## Parser
## Formatter
- [ ] `Sentence`/`Semantic` (sembr) wrap modes—both fall back to `Preserve`
today. *Demoted, much later.*
- [ ] **Argument content-kind taxonomy.** `prose`/`collapse` are two ad-hoc
bools on `ArgSpec`; the real model is a per-argument *content kind*
(opaque, token-list, prose, document-body) the formatter dispatches
whitespace and break policy on. Generalize once a third case appears. The
non-determinism fix (`spans_multiple_lines` deciding block-vs-inline from
incidental source newlines) is sidestepped for collapse-flagged args but
still governs every *unflagged* multi-line group—revisit when the
taxonomy lands.
- [ ] **Long collapsed cite list overflow.** A `collapse` arg folds to one line
even when the key list exceeds the width; it never breaks *at commas* (one
key per line) as a fallback. Needs the token-list content kind to break on
its own separators rather than the paragraph fill.
- [ ] Widen the prose-argument table (CWL ingest could feed it); consider gluing
a prose arg onto its command line when a source break separates them.
- [x] **Brace-group body reflow (`ReflowKind::Statement`).** A multi-line brace
body (a `\newcommand` definition body) now reflows as code-like *statements*: each
source line stays its own logical line, but an over-long one wraps to the width
(breaking before a trailing `{…}` atom) instead of forcing the printer to detonate
the innermost nested prose group—the only soft break a rigid
`lower_element_stream` body exposed. Continuation is **flush**, not hanging.
- [ ] **Hanging continuation indent for wrapped statements (B', deferred ---
blocked on structure).** A wrapped brace-body line ideally hangs its continuation
one step in (`\node[…] at (2,3)`/`····{…};`) to read as a continuation rather
than a sibling. This **cannot be idempotent** under the generic CST: the wrap
becomes a real source newline, and on re-parse the continuation is just a line at
the body indent (no marker says "continuation"), so the next pass flushes it ---
`fmt(fmt(x)) != fmt(x)`. Flush-B sidesteps this precisely because there is no
indent delta. The real fix needs a node that *owns the whole statement*, so layout
derives from structure (source newlines insignificant). For the motivating case
(`\node[…] at (2,3) {…};`) that node is a **TikZ path statement**: `at` keyword,
`(coord)`, `;` terminator, `{label}`—none of which are TeX-surface facts
(`;`/`at`/`()` carry no special catcode in plain TeX), so grouping them is
package-specific grammar, out of scope for the generic parser (decisions #1, #2;
non-goals). Belongs in a future sanctioned **TikZ-aware mode** (its own grammar,
corpus, and AGENTS.md amendment), not a formatter patch.
- [ ] Join alignment-cell continuation lines (currently triggers the plain-body
fallback).
- [ ] Column-spec-aware L/C/R cell alignment and `\multicolumn` for the table
environments (`tabular`/`array` are now grid-aligned, but every column is
left-aligned regardless of its `{lcr}` spec). Also: `\cmidrule(lr){2-3}`
paren trim specs (the parenthesized part isn't recognized as part of the
rule line, so such a line is treated as a cell and the table falls back),
and the same-line `\\ \hline` form (only own-line rule commands become
passthrough lines today).
## Linter
- [~] Wire the remaining report-only fixes onto the autofix infra:
`deprecated-command`'s `\bf → \bfseries` is **done** (a `Safe` control-word swap,
consumed by `lint --fix` and the new LSP code actions); `obsolete-environment`'s
`eqnarray → align` is still report-only.
- [~] More stylistic lints. `missing-nonbreaking-space` (a tie before a cite/ref
command, broad curated family, `\nocite` excluded; `Unsafe` autofix) is **done**.
Remaining: general typography (quotes, dashes, …). *Follow-up:* the tie lint only
covers a same-line `WORD WHITESPACE \cmd` shape; a *source line break* before the
command (`Figure\n\ref{x}`) is also a breakable space but is left for a later pass
(replacing the newline with `~` reflows the source and overlaps the formatter).
- [ ] `unused-label` (cross-file)—deferred: can false-positive on labels
referenced from outside the analyzed set.
## Semantic layer & signatures
- [ ] How much of `\newcommand`/`xparse` to model for the signature DB. *(open
decision)*
## Language server
### Configuration & sync
- [x] config over LSP—the LSP now discovers `badness.toml` per document
(`GlobalState::resolve_settings`, cached by anchor dir, cleared on
`didChangeConfiguration`). A discovered config wins outright
(file-wins); editor settings are the fallback. Both `[format]` (`line-width`,
`indent-width`, `wrap`) and `[lint]` (`select`/`ignore`, applied via
`RuleSelection` in the analyze/diagnostic/code-action paths) are honored. Two
follow-ups remain:
- Deliberately *not* done: plumbing `wrap` (or other knobs) through
`EditorSettings` itself. A discovered config's `wrap` flows via `FormatConfig`,
so no new editor knob was needed; `EditorSettings` stays `line_width`/`indent_width`.
- [ ] `workspace/diagnostic` (the workspace-wide pull)—deferred: it is a
streaming/long-poll protocol (held-open request, per-uri result ids, partial
results) that fits the one-shot id-bound read-job model poorly. Advertise
`workspace_diagnostics: true` and add it once that plumbing exists; editors
drive interactive diagnostics through `textDocument/diagnostic` meanwhile.
- [x] `workspace/didChangeWatchedFiles` + dynamic `client/registerCapability`
for `**/*.{tex,bib}` and `badness.toml` so on-disk edits to non-open
includes/`.bib` files (the project graph's leaves) and the config reanalyze.
Watchers are registered post-handshake (`register_file_watchers`) when the client
advertises `didChangeWatchedFiles.dynamicRegistration`; no `notify`-crate fallback
otherwise. A `.tex`/`.bib` change re-reads + re-upserts the non-open file
(`Worker::apply_watched_change`, scoped to seeded project dirs) and re-lints via
`RelintAll`; a `badness.toml` change clears the config cache and re-lints
(`relint_all_open`). Open buffers are skipped (the editor overlay is authoritative).
### Formatting
- [x] Range formatting (`textDocument/rangeFormatting`)—expand the selection to
whole top-level blocks (children of `ROOT`), lower only those (a byte-range
emission filter in `format_node_range_with_signatures`, so the formatter stays
the sole layout authority and out-of-range blocks are never laid out), then
diff the fragment against the original slice into minimal edits. LaTeX only;
bib is a no-op for now.
- [ ] On-type formatting (`textDocument/onTypeFormatting`), e.g. re-indent on
`}`/`\end{…}` close. *Lower priority; opt-in trigger characters.*
### Navigation & structure
- [ ] Selection ranges (`textDocument/selectionRange`)—expand-selection from
the CST's node hierarchy (group → argument → command → environment).
- [x] Workspace symbols (`workspace/symbol`)—the per-file outline (sections,
labels, floats, theorems, macros, environments) aggregated across every tracked
project file, filtered by the query string. LaTeX members only.
### Labels & references
### IntelliSense (signature DB)
- [ ] Signature help (`textDocument/signatureHelp`)—show the active argument
while typing a command's `{…}`/`[…]` arguments.
### Code actions
### Infrastructure
- [ ] Client capability negotiation—gate advertised providers and
UTF-8/UTF-16 position encoding on what `initialize` reports.
- [ ] README editor-wiring docs (Neovim/VS Code `initializationOptions`,
`badness lsp` invocation).
## Package & class infrastructure (`.sty`/`.cls`/`.dtx`/`.ins`)
The document-level tools are mature; the next frontier is the **package
ecosystem**—class and package sources, and the literate `.dtx` format they
ship in. This is a large, multi-area subproject (parser + formatter + semantic).
It stays inside the AGENTS.md non-goals: bounded, statically-recognizable
patterns only, signatures *extracted, never executed*, no docstrip run, no TeX
engine. Local project files only—a `texmf`/CTAN/`kpsewhich` search is out of
scope (the same boundary the include graph and CWL ingest keep).
### Parsing
- [x] **expl3 full catcode model.** `~` is a literal space (catcode 10) and spaces/tabs
are ignored (catcode 9) inside expl3 regions, so the formatter owns in-region layout
regardless of `WrapMode`: one statement per source line, brace-group code blocks indent,
inter-token whitespace collapses to a single space, and `~` is a breakable space. The
formatter recomputes region membership read-only (`formatter::core::expl3_regions`, sharing
the lexer's `expl_toggle` set); the CST/lexer are untouched, so losslessness holds and the
layout is idempotent by construction (catcode-9 whitespace is insignificant, so it supersedes
the flush-B deferral for expl3 — see "Hanging continuation indent" above). See the expl3
code-formatting note in `AGENTS.md` decision #1.
- [ ] **expl3 implicit detection in toggle-less `.dtx` (deferred).** Real expl3
package sources (e.g. `ltx-talk-structure.dtx`) carry no in-file `\ExplSyntaxOn`/
`\ProvidesExpl*`; expl3 is declared in the parent `.dtx`/build, and `@@` is a
docstrip module prefix (`%<@@=mod>`). Treat `macrocode` bodies as expl3 when the
file carries a static expl3 signal (a `%<@@=mod>` guard or `\ProvidesExpl*`
anywhere). Needs a file-level scan plus the `macrocode` save/restore interaction
(mirror `at_letter`).
### Formatting
### Semantic and integration
- [ ] **Signature extraction from package sources.** Run the existing
`semantic/define.rs` scanner across loaded `.sty`/`.cls` (and `macrocode`
blocks of a `.dtx` when no generated `.sty` is present), extending it to
`\DeclareRobustCommand`/`\DeclareDocumentCommand` and friends. Prefer a
generated `.sty` over its `.dtx` source when both exist.
- [ ] **Package metadata & options (recognize, never execute).**
`\ProvidesPackage`/`\ProvidesClass` (name/date/version),
`\NeedsTeXFormat`, `\DeclareOption`/`\ProcessOptions`/`\ExecuteOptions` ---
surfaced as signatures/metadata for hover/diagnostics, never run.
- [ ] **Package-aware diagnostics.** Once the load graph exists: unknown-option,
duplicate `\RequirePackage`, missing `\ProvidesPackage`, and resolving
user-macro definitions to their defining package for hover/go-to-definition.
## Performance & hardening
- [ ] Fuzzing (losslessness must hold on arbitrary input).
- [~] Large-doc benchmarks (`hyperfine`, criterion); flamegraph hot paths.
Formatter speed bench vs `tex-fmt`/`latexindent` landed (`benches/compare_format.sh`,
`task bench`, writes `benches/benchmark_results.json`, which feeds the docs
benchmark page `docs/src/reference/benchmarks.md`). In-process parse/format micro-bench +
flamegraph hot paths landed (`benches/formatting.rs`, `task bench:micro`/`bench:profile`;
see the profiling item below). Still pending: bib + lint benchmarks.
- [x] **Profile the formatter (separate startup floor from per-byte cost).** Done:
in-process micro-bench (`benches/formatting.rs`, `task bench:micro`, split
parse/format/full + throughput, with a single-doc flamegraph hook
`task bench:profile`; modeled on panache's `benches/formatting.rs` rather than
criterion so the flamegraph attaches to one hot doc). Findings (full writeup in
`benches/README.md`):
- **Startup floor was the one-time CWL signature-DB init, not binary load—now
fixed.** A bare `--version` is ~0.8 ms, but the format path's floor was ~4.4 ms:
`cwl()` decompressed+parsed the embedded `cwl_signatures.json.gz` once (~4.5 ms,
`LazyLock`) and is on the hot path (`Signatures::command`/`environment` fall
back to it; the lexer uses it for verbatim-env detection). **Fixed:** the CWL
tier is now baked into the binary as a build-time `phf` perfect-hash map
(`build.rs` + `phf_codegen`, values are `const fn` constructor calls;
`CommandSig`/`EnvironmentSig` args became `Cow<'static,[ArgSpec]>`), so init is
~0—no decompress, no parse. CLI `small.tex` dropped ~4.5 ms → ~1.3 ms,
`cv.tex` ~5.1 ms → ~1.4 ms. Trade-off: larger binary (uncompressed statics) and
a build-time codegen step; `flate2` dropped, `phf`/`phf_codegen` added. The
curated `builtin` DB (~0.09 ms) stays a runtime JSON `LazyLock`—negligible.
- **Per-byte cost is mostly architectural.** masters_dissertation in-process:
parse ~25 %, lower+print ~70 %, ~10 MB/s. Flamegraph self-time is dominated by
rowan red-tree cursor traversal (~25–30 %) + allocator churn (~17 %)—inherent
to the lossless CST + `Doc` IR, by design. Printer itself is ~7 %. Minor slack:
`lower_node` runs up to four direct-children predicate scans per `ENVIRONMENT`
(`has_verbatim_body`/`is_margin_framed`/`is_alignment_env`/`is_list_env`) that
could share one pass. *(Speed-only; no correctness implication.)*
- [ ] Intra-file incremental reparse (reuse green subtrees on contained edits).
- [ ] `wasm32` build for a web playground.
## BibTeX/BibLaTeX
- [ ] Cross-file `undefined-string`: a `@string` defined in one `.bib` and used
in another resolves only once a project-level `@string` union exists (today
single-file-sound, same caveat as `unused-string`).
- [x] Bib-aware LSP completion: `@string` macro names in value position, field
names per entry type (type-scoped, hiding fields already present), and entry
types after `@` (`src/bib/completion.rs`); plus `\cite` key completion on the
`.tex` side, resolved cross-file via `ResolvedCitations` (`src/lsp.rs`
`cite_completion_items`).
- [ ] Bib document-symbol outline completeness: `src/bib/outline.rs` surfaces
regular entries only; consider `@string`/`@preamble`/`@comment` blocks (and a
richer `SymbolKind`/detail).
- [ ] `title-capitalization` refinement: the acronym heuristic flags mid-word
capitals, so CamelCase names (`McDonald`, `DeForest`) are false positives ---
a curated name-particle allowlist or a smarter word model would tighten it.
- [ ] Shared component-finder: `ResolvedCitations` duplicates the union-find +
component assignment from `ResolvedLabels` (`project/citations.rs`); factor one
helper when a third consumer appears.
--------------------------------------------------------------------------------
## Open decisions to revisit
- [ ] How much of `\newcommand`/`xparse` to model. *(Semantics)*
- [ ] Formatter opinionatedness: configurable vs. fixed. *(Formatter)*
- [ ] `.dtx` two-layer model: a preprocessor that splits doc/code layers, or a
single lexer mode with margin-aware tokens? *(Package infrastructure)*