texform-transform
Internal implementation crate for texform. Do not depend on this crate directly — its API has no stability guarantees and may change in any release. Use the texform facade crate instead.
A phase-oriented AST rewrite pipeline for TeXForm. It normalizes a parsed Ast into a canonical form so downstream consumers — formula equivalence comparison, MER tokenization, LLM pretraining corpora, polished authoring output — can work against a stable shape without re-implementing LaTeX semantics per use case. This README is the in-depth reference for the transform subsystem: rule authors and contributors should start here.
The crate is a thin wrapper around five ordered phases. Callers choose a build-time [Profile] / [BuildConfig] to compile a rewrite plan, then use per-run [TransformConfig] values to gate phases and set runtime limits.
Quick start
use ;
use ;
let parse_ctx = from_packages;
let mut ast = parse_ctx
.parse_to_ast
.expect;
// Pick a profile. `Faithful` preserves layout while expanding commands; use
// `Corpus` to drop layout hints, `Equiv` for equivalence comparison, and
// `Authoring` for author-facing output.
let context = from_build_config
.expect;
let report = context
.run
.expect;
println!;
println!;
For repeated transforms with the same configuration, build a context once and reuse it:
use ;
let context = from_build_config?;
for mut ast in batch
Public API
The crate's public surface is intentionally small:
| Item | Purpose |
|---|---|
BuildConfig::profile(profile) |
Select build-time normalization levels and default runtime config. |
TransformContext::from_build_config(config, parse_ctx) -> Result<Self, TransformBuildError> |
Precompile the rewrite plan once for reuse across many ASTs. |
TransformContext::run(ast, parse_ctx) |
Execute the precompiled pipeline with the profile default runtime config. |
TransformContext::run_with(ast, parse_ctx, config) |
Execute the precompiled pipeline with per-run overrides. |
TransformConfig |
Runtime phase gates, FlattenGroups behavior, and max rewrite iterations. |
TransformReport |
Per-phase reports aggregated across the run. |
TransformError / TransformBuildError |
Build-time and run-time error types. |
The rewrite phase additionally re-exports RewriteRule, NormalizationLevel, NormalizationLevelSet, RuleKey, RuleMeta, RuleFidelity, RuleTarget, RuleTargetKey, RuleTargetKind, Plan as RewritePlan and related items for callers that need to introspect rules.
Pipeline
TransformContext::run executes a fixed sequence of phases. Normalization levels are chosen when the context is built; each run may disable rewrite / lower attributes or choose different FlattenGroups and iteration settings through TransformConfig.
- LowerAttributes (pre) — canonicalize declarative-scope commands (e.g.
\bf x) and registered prefix wrappers (e.g.\mathbf{x}) into a single normal form. - Rewrite — apply the precompiled rewrite plan in a fixed-point loop, bounded by
max_iterations. - LowerAttributes (post) — re-canonicalize attribute markers introduced by rewrite rules (some Standard / Expand rules emit prefix wrappers that need lowering again).
- FinalizeAst — local AST cleanup that does not depend on rewrite metadata, currently merging adjacent
Primenodes produced by rewrite rules. - FlattenGroups — remove redundant explicit and implicit groups after the earlier phases have stabilized.
Phase order is fixed; only the per-phase flags are configurable.
Configuration
TransformConfig
Profiles
Each profile selects build-time normalization levels and supplies a default runtime config.
| Profile | Normalization levels | flatten_groups |
Target scenario |
|---|---|---|---|
Authoring |
Standard |
STRICT |
Polished author-facing formatting; stylistic choices kept. |
Faithful |
Standard + Expand |
STRICT |
Preserves the rendered formula while allowing expanded source forms. |
Corpus |
Standard + Expand + Drop |
STRUCTURAL_ONLY |
MER training data normalization; layout hints dropped. |
Equiv |
Standard + Expand + Drop + Equiv |
STRUCTURAL_ONLY |
Formula equivalence comparison. |
NormalizationLevel
Every rule belongs to exactly one ordered level. A rule's level is the first profile that accepts the rule output as a suitable product; it is not inferred from render fidelity.
| Level | Intent |
|---|---|
Standard |
Uncontroversial author-facing standardization: legacy-syntax modernization, typo fixes, alias canonicalization, cross-package anchor unification. Does not collapse stylistic choices that an author may legitimately make. |
Expand |
Expands convenience commands, semantic macros, package-specific commands, and spacing primitives into more explicit universal structures while preserving the rendered formula. Output remains readable LaTeX and may be visually, rather than pixel, equivalent. |
Drop |
Removes non-ink, metadata, and layout hints a corpus should not learn: linebreak preferences, invisible layout nodes, and similar caller-opt-in deletions. |
Equiv |
Output is only suitable as an equivalence-checking intermediate, not as a corpus label. Fidelity does not decide this level: fenced matrix environment expansion is Full but still Equiv. |
Classify a rule by asking which profile first accepts its output, then declare
the rule's fidelity independently. fidelity may rule out profiles whose floor
it cannot meet, but a high-fidelity rule is not automatically a lower level.
RuleFidelity
fidelity is the worst-case render-fidelity guarantee over the rule's declared
input domain. It is ordered from least to most faithful:
Semantic < Approximate < Full.
| Fidelity | Guarantee |
|---|---|
Full |
Pixel-identical rendering before and after the rewrite. |
Approximate |
Visually equivalent apart from spacing or placement. |
Semantic |
Mathematical meaning is preserved; rendering may change. |
fidelity is a metadata contract only. texform-transform runs no rendering
comparison; how a downstream validator interprets a fidelity level when comparing
rendered output is defined by that consumer, not in this crate.
fidelity must not fall below the rule's level floor:
| Level | Default fidelity | Min fidelity |
|---|---|---|
Standard |
Full |
Approximate |
Expand |
Full |
Approximate |
Drop |
Semantic |
Semantic |
Equiv |
Full |
Semantic |
Do not add a second metadata field for ordinary behavior. If a rule has an
important gap between its worst case and usual samples, document that gap in the
rule's top-level comment. For example, displaylines-to-gather-env is
level: Drop and fidelity: Semantic: ordinary \displaylines samples usually
look closer to Approximate, but manual layout commands such as \hfill or
\llap can make MathJax reflow or overlap hand-written equation numbers.
FlattenGroupsConfig
FlattenGroups removes structurally redundant Explicit and Implicit groups. The four core actions are:
| Action | Trigger |
|---|---|
removed_empty |
Empty GroupChild ({}) is dropped. |
replaced_single_child |
Single-child GroupChild is replaced by its child. |
inlined_multi_child |
Multi-child GroupChild is spliced into its parent's child sequence. |
unwrapped_slot |
Single-child group occupying an Argument / ScriptSub / ScriptSup / Infix* slot is unwrapped. |
These actions are the default behavior. Eleven configuration flags (ten independent preserve guards plus one sub-flag) gate the actions in specific contexts. Each preserve guard belongs to one of two categories:
- Semantic guards — disabling them changes script binding, environment cell boundaries, declarative scope, or infix scope. Both parsed semantics and rendered output change.
- Spacing guards — disabling them only affects atom-spacing and unary/binary classification context. Parsed semantics are unchanged; rendered output may differ by a thin space.
Preserve guards
| # | Field | Category | Example | What the guard preserves |
|---|---|---|---|---|
| 1 | preserve_group_containing_declarative_command |
Semantic | {\bf x} y |
Groups whose subtree contains a declarative command (e.g. \cal, \bf), to avoid leaking declarative scope into following siblings. |
| 2 | preserve_group_in_script_base_slot |
Semantic | {ab}^2 |
Groups occupying a ScriptBase slot, to avoid changing which atom subscripts or superscripts attach to. |
| 3 | preserve_group_inside_env_body |
Semantic | \begin{matrix} {a} & b \end{matrix} |
All groups inside an environment body, to preserve cell boundaries and intra-cell spacing. |
| 4 | preserve_group_containing_infix |
Semantic | {a \over b} |
GroupChilds whose subtree contains an \over-style infix, to preserve the infix scope. |
| 5 | preserve_group_adjacent_to_command_like |
Spacing | \cos{A}, {\int} |
GroupChilds whose preceding sibling or first child is command-like. |
| 6 | preserve_group_as_argument_of_command |
Spacing | \overline{{\sum}} |
Risky singleton groups directly used as arguments of commands, preserving one spacing boundary while still flattening redundant nesting. |
| 7 | preserve_empty_group |
Spacing | {} |
Empty GroupChilds, to preserve spacing / kerning effects. |
| 8 | preserve_group_with_lone_atom_spacing_char |
Spacing | {+}, {,}, {*}_N, {·}m |
Singleton groups containing only one math atom-spacing character. |
| 9 | preserve_group_starting_with_atom_spacing_char |
Spacing | {+x}, {,y} |
Multi-child GroupChilds whose first child is a math atom-spacing character. |
| 10 | preserve_group_containing_delimited_pair |
Spacing | {\left( a \right)} |
GroupChilds whose subtree contains a \left…\right delimited group. |
Atom-spacing characters: = < > + - , : ; . / * ! ? | ·.
Sub-flag
| Field | Depends on | Example | Effect |
|---|---|---|---|
preserve_group_after_scripted_command_like |
preserve_group_adjacent_to_command_like |
\sin^2{x} |
When classifying "command-like" for the adjacency check, recurse through Scripted bases. Disabled, \sin^2 (a Scripted node) is no longer treated as command-like and the trailing {x} is flattened. |
This sub-flag does not gate any group on its own; it only refines the classification used by guard #5. When preserve_group_adjacent_to_command_like is false, the sub-flag has no effect.
Preset values
The preserve guards are wired to presets via two named constants:
FlattenGroupsConfig::STRICT— all guards on. Used byAUTHORINGandFAITHFUL.FlattenGroupsConfig::STRUCTURAL_ONLY— only semantic guards on; all spacing guards off. Used byCORPUSandEQUIV.
| Field | Category | AUTHORING / FAITHFUL (STRICT) |
CORPUS / EQUIV (STRUCTURAL_ONLY) |
|---|---|---|---|
enabled |
– | ✓ | ✓ |
preserve_group_containing_declarative_command |
Semantic | ✓ | ✓ |
preserve_group_in_script_base_slot |
Semantic | ✓ | ✓ |
preserve_group_inside_env_body |
Semantic | ✓ | ✓ |
preserve_group_containing_infix |
Semantic | ✓ | ✓ |
preserve_group_adjacent_to_command_like |
Spacing | ✓ | – |
preserve_group_as_argument_of_command |
Spacing | ✓ | – |
preserve_group_after_scripted_command_like |
Spacing | ✓ | – |
preserve_empty_group |
Spacing | ✓ | – |
preserve_group_with_lone_atom_spacing_char |
Spacing | ✓ | – |
preserve_group_starting_with_atom_spacing_char |
Spacing | ✓ | – |
preserve_group_containing_delimited_pair |
Spacing | ✓ | – |
Additional constants: ENABLED (alias for STRICT), DISABLED (no flattening at all), DEFAULTS (= STRICT).
Reports
TransformReport aggregates per-phase reports for observability and diagnostics:
LowerAttributesReport—attributes(HashMap<AttributeSet, AttributeStat>) pluseliminated_empty_segments; each attribute stat hasconsumed,redundant, andemittedcounts split intodeclarativesandprefixes. The report aggregates all LowerAttributes invocations in one transform run, so the default pipeline combines pre-Rewrite and post-Rewrite counts.RewriteReport—iterations(fixed-point iteration count) andrules(Vec<RewriteRuleStat>withkey,applied_count,skipped_countper rule that was attempted at least once).FinalizeAstReport—stepswith oneapplied_countcounter per cleanup step (currentlymerge_adjacent_primes).FlattenGroupsReport—actionsfor the four action counters andguardsfor one hit counter per preserve guard. Hit counters are short-circuit: when several guards would apply to the same group, only the first one that matches in the internal evaluation order is incremented.
The stable facade DTO used by the Python and WebAssembly bindings flattens the same information into a transport-safe shape:
{
iterations,
rules: [{ key, applied_count, skipped_count }],
finalize_ast: {
steps: { merge_adjacent_primes: { applied_count } }
},
flatten_groups: {
actions: { removed_empty, replaced_single_child, inlined_multi_child, unwrapped_slot },
guards: { preserve_group_* }
},
lower_attributes: {
attributes: [{
attr,
value,
consumed: { declaratives, prefixes },
redundant: { declaratives, prefixes },
emitted: { declaratives, prefixes }
}],
eliminated_empty_segments
}
}
Phase internals
LowerAttributes
Two sub-modules drive this phase: lower_attributes/codegen.rs and build-time generated data emitted into OUT_DIR. The phase scans the AST for declarative commands (e.g. \bf, \large, \sf) and registered prefix wrappers (e.g. \mathbf{...}, \textbf{...}), then normalizes both forms into a single canonical representation per attribute.
Attributes are modeled as a structured AttributeSet (Attr × AttrValue) covering math font, math size, math style, text family, text series, text shape, and text size. Inherited state is tracked across container boundaries so that nested declarations, prefix wrappers, and empty trailing segments normalize cleanly.
The phase runs twice in the pipeline (pre and post Rewrite) under a single enabled switch because rewrite rules may emit prefix wrappers as their right-hand side; the post-pass re-canonicalizes those into the same normal form as the pre-pass. LowerAttributesReport uses a single cumulative counter set for both invocations.
Rewrite
Rules live under src/rewrite/rules/{base, ams, braket, physics}/ and are auto-registered through src/rewrite/rules/generated.rs (maintained by build.rs). Each rule is a unit struct implementing RewriteRule with a static RuleMeta descriptor.
RuleMeta declares:
key: RuleKey(package/name) — the stable identifier used in reports and build-time filters.enabled_by_packages— packages whose presence in theParseContextenables the rule.level— seeNormalizationLevel.fidelity— seeRuleFidelity.triggers—RuleTargets the scheduler watches to know when to attempt the rule.consumes— forms the ruleeliminates(must not appear in the output) andtouches(may read or modify).produces— forms the rule may introduce. The engine verifies every produced form is either accepted by the output contract or eliminated by another rule.
TransformContext::from_build_config builds a Plan by filtering all registered rules through the selected profile levels, build-time disabled rules, and the ParseContext's enabled packages. The plan is then driven by scheduler::drive_fixed_point until either no rule fires in an iteration or max_iterations is exceeded. When Rewrite is enabled, after the full pipeline has run post-Rewrite LowerAttributes and FlattenGroups, the engine uses collect_eliminated_violations to verify that no rule's eliminates set remains in the output AST; a violation is reported as RewriteError::ContractViolation.
See src/rewrite/rules/README.md for rule authoring conventions and the macro-based DSL used to define rules.
FinalizeAst
A single pass (src/finalize_ast/) for local AST cleanup that does not depend on rewrite metadata. Its current step merges adjacent Prime nodes produced by rewrite rules, so f^{\prime\prime} normalizes through Prime(1), Prime(1) into the same final shape as f''. The phase is enabled by default in every profile and gated by TransformConfig.finalize_ast.enabled.
FlattenGroups
A single recursive traversal (visit → try_unwrap in src/flatten_groups/mod.rs). For each node the visitor:
- Collects subtree-wide flags (
has_declarative,has_infix,has_delimited) on the way down. - Tracks the
in_env_bodycontext flag throughSlot::EnvBodyedges. - On the way back up, calls
try_unwrapto check whether the current group should be flattened. Each preserve guard short-circuits with an early return that increments its hit counter; the first matching guard wins. - If no guard fires and the group's content mode matches its parent's context mode, the group is unwrapped via either
unwrap_group_child(multi-child splice) orredirect_single_child_slot(single-child slot replacement).
The slot_can_unwrap helper restricts redirect-style unwrapping to single-child groups in Argument, Script*, and Infix* slots; EnvBody slots are never unwrapped.
Errors
TransformBuildError is raised by TransformContext::from_build_config when the rewrite plan cannot be assembled, for example when a required package is missing. All other errors surface during execution.
Tests
Integration tests cover the phases and their interactions:
tests/flatten_groups.rs— preserve-guard toggles,STRICTvsSTRUCTURAL_ONLY, action and per-guard counters.tests/lower_attributes.rs— declarative consumption, prefix wrapping, inherited-state absorption.tests/finalize_ast.rs— adjacent-Primemerging and the phase gate.tests/rewrite_rule.rs— single-rule execution and metadata contracts.tests/rewrite_context.rs— theRuleContextAST view exposed to rules.tests/transform_contract.rs— eliminated-form contract checking across the pipeline.tests/config_model.rs— profile and config invariants.
Run with:
See also
- High-level overview: repository README.
- Architecture:
ARCHITECTURE.md(Transform Engine section). - Rule authoring guide:
src/rewrite/rules/README.md.