texform-transform 0.1.0

Profile-based AST transform engine for TeXForm (internal; use the texform crate)
Documentation

texform-transform

Internal implementation crate for texform. Do not depend on this crate directly — its API has no stability guarantees and may change in any release. Use the texform facade crate instead.

A phase-oriented AST rewrite pipeline for TeXForm. It normalizes a parsed Ast into a canonical form so downstream consumers — formula equivalence comparison, MER tokenization, LLM pretraining corpora, polished authoring output — can work against a stable shape without re-implementing LaTeX semantics per use case. This README is the in-depth reference for the transform subsystem: rule authors and contributors should start here.

The crate is a thin wrapper around five ordered phases. Callers choose a build-time [Profile] / [BuildConfig] to compile a rewrite plan, then use per-run [TransformConfig] values to gate phases and set runtime limits.

Quick start

use texform_core::parse::{ParseConfig, ParseContext};
use texform_transform::{BuildConfig, Profile, TransformContext};

let parse_ctx = ParseContext::from_packages(&["base", "ams"]);
let mut ast = parse_ctx
    .parse_to_ast(r"\frac{a}{b}", &ParseConfig::default())
    .expect("source should parse");

// Pick a profile. `Faithful` preserves layout while expanding commands; use
// `Corpus` to drop layout hints, `Equiv` for equivalence comparison, and
// `Authoring` for author-facing output.
let context = TransformContext::from_build_config(
    BuildConfig::profile(Profile::Faithful),
    &parse_ctx,
)
.expect("transform context should build");

let report = context
    .run(&mut ast, &parse_ctx)
    .expect("transform should succeed");

println!("rewrite iterations: {}", report.rewrite.iterations);
println!(
    "flatten removed_empty: {}",
    report.flatten_groups.actions.removed_empty
);

For repeated transforms with the same configuration, build a context once and reuse it:

use texform_transform::{BuildConfig, Profile, TransformContext};

let context = TransformContext::from_build_config(
    BuildConfig::profile(Profile::Faithful),
    &parse_ctx,
)?;
for mut ast in batch {
    let _report = context.run(&mut ast, &parse_ctx)?;
}

Public API

The crate's public surface is intentionally small:

Item Purpose
BuildConfig::profile(profile) Select build-time normalization levels and default runtime config.
TransformContext::from_build_config(config, parse_ctx) -> Result<Self, TransformBuildError> Precompile the rewrite plan once for reuse across many ASTs.
TransformContext::run(ast, parse_ctx) Execute the precompiled pipeline with the profile default runtime config.
TransformContext::run_with(ast, parse_ctx, config) Execute the precompiled pipeline with per-run overrides.
TransformConfig Runtime phase gates, FlattenGroups behavior, and max rewrite iterations.
TransformReport Per-phase reports aggregated across the run.
TransformError / TransformBuildError Build-time and run-time error types.

The rewrite phase additionally re-exports RewriteRule, NormalizationLevel, NormalizationLevelSet, RuleKey, RuleMeta, RuleFidelity, RuleTarget, RuleTargetKey, RuleTargetKind, Plan as RewritePlan and related items for callers that need to introspect rules.

Pipeline

TransformContext::run executes a fixed sequence of phases. Normalization levels are chosen when the context is built; each run may disable rewrite / lower attributes or choose different FlattenGroups and iteration settings through TransformConfig.

  1. LowerAttributes (pre) — canonicalize declarative-scope commands (e.g. \bf x) and registered prefix wrappers (e.g. \mathbf{x}) into a single normal form.
  2. Rewrite — apply the precompiled rewrite plan in a fixed-point loop, bounded by max_iterations.
  3. LowerAttributes (post) — re-canonicalize attribute markers introduced by rewrite rules (some Standard / Expand rules emit prefix wrappers that need lowering again).
  4. FinalizeAst — local AST cleanup that does not depend on rewrite metadata, currently merging adjacent Prime nodes produced by rewrite rules.
  5. FlattenGroups — remove redundant explicit and implicit groups after the earlier phases have stabilized.

Phase order is fixed; only the per-phase flags are configurable.

Configuration

TransformConfig

pub struct TransformConfig {
    pub rewrite_enabled: bool,
    pub lower_attributes_enabled: bool,
    pub finalize_ast: FinalizeAstConfig,
    pub flatten_groups: FlattenGroupsConfig,
    pub max_iterations: usize,
}

Profiles

Each profile selects build-time normalization levels and supplies a default runtime config.

Profile Normalization levels flatten_groups Target scenario
Authoring Standard STRICT Polished author-facing formatting; stylistic choices kept.
Faithful Standard + Expand STRICT Preserves the rendered formula while allowing expanded source forms.
Corpus Standard + Expand + Drop STRUCTURAL_ONLY MER training data normalization; layout hints dropped.
Equiv Standard + Expand + Drop + Equiv STRUCTURAL_ONLY Formula equivalence comparison.

NormalizationLevel

Every rule belongs to exactly one ordered level. A rule's level is the first profile that accepts the rule output as a suitable product; it is not inferred from render fidelity.

Level Intent
Standard Uncontroversial author-facing standardization: legacy-syntax modernization, typo fixes, alias canonicalization, cross-package anchor unification. Does not collapse stylistic choices that an author may legitimately make.
Expand Expands convenience commands, semantic macros, package-specific commands, and spacing primitives into more explicit universal structures while preserving the rendered formula. Output remains readable LaTeX and may be visually, rather than pixel, equivalent.
Drop Removes non-ink, metadata, and layout hints a corpus should not learn: linebreak preferences, invisible layout nodes, and similar caller-opt-in deletions.
Equiv Output is only suitable as an equivalence-checking intermediate, not as a corpus label. Fidelity does not decide this level: fenced matrix environment expansion is Full but still Equiv.

Classify a rule by asking which profile first accepts its output, then declare the rule's fidelity independently. fidelity may rule out profiles whose floor it cannot meet, but a high-fidelity rule is not automatically a lower level.

RuleFidelity

fidelity is the worst-case render-fidelity guarantee over the rule's declared input domain. It is ordered from least to most faithful: Semantic < Approximate < Full.

Fidelity Guarantee
Full Pixel-identical rendering before and after the rewrite.
Approximate Visually equivalent apart from spacing or placement.
Semantic Mathematical meaning is preserved; rendering may change.

fidelity is a metadata contract only. texform-transform runs no rendering comparison; how a downstream validator interprets a fidelity level when comparing rendered output is defined by that consumer, not in this crate.

fidelity must not fall below the rule's level floor:

Level Default fidelity Min fidelity
Standard Full Approximate
Expand Full Approximate
Drop Semantic Semantic
Equiv Full Semantic

Do not add a second metadata field for ordinary behavior. If a rule has an important gap between its worst case and usual samples, document that gap in the rule's top-level comment. For example, displaylines-to-gather-env is level: Drop and fidelity: Semantic: ordinary \displaylines samples usually look closer to Approximate, but manual layout commands such as \hfill or \llap can make MathJax reflow or overlap hand-written equation numbers.

FlattenGroupsConfig

FlattenGroups removes structurally redundant Explicit and Implicit groups. The four core actions are:

Action Trigger
removed_empty Empty GroupChild ({}) is dropped.
replaced_single_child Single-child GroupChild is replaced by its child.
inlined_multi_child Multi-child GroupChild is spliced into its parent's child sequence.
unwrapped_slot Single-child group occupying an Argument / ScriptSub / ScriptSup / Infix* slot is unwrapped.

These actions are the default behavior. Eleven configuration flags (ten independent preserve guards plus one sub-flag) gate the actions in specific contexts. Each preserve guard belongs to one of two categories:

  • Semantic guards — disabling them changes script binding, environment cell boundaries, declarative scope, or infix scope. Both parsed semantics and rendered output change.
  • Spacing guards — disabling them only affects atom-spacing and unary/binary classification context. Parsed semantics are unchanged; rendered output may differ by a thin space.

Preserve guards

# Field Category Example What the guard preserves
1 preserve_group_containing_declarative_command Semantic {\bf x} y Groups whose subtree contains a declarative command (e.g. \cal, \bf), to avoid leaking declarative scope into following siblings.
2 preserve_group_in_script_base_slot Semantic {ab}^2 Groups occupying a ScriptBase slot, to avoid changing which atom subscripts or superscripts attach to.
3 preserve_group_inside_env_body Semantic \begin{matrix} {a} & b \end{matrix} All groups inside an environment body, to preserve cell boundaries and intra-cell spacing.
4 preserve_group_containing_infix Semantic {a \over b} GroupChilds whose subtree contains an \over-style infix, to preserve the infix scope.
5 preserve_group_adjacent_to_command_like Spacing \cos{A}, {\int} GroupChilds whose preceding sibling or first child is command-like.
6 preserve_group_as_argument_of_command Spacing \overline{{\sum}} Risky singleton groups directly used as arguments of commands, preserving one spacing boundary while still flattening redundant nesting.
7 preserve_empty_group Spacing {} Empty GroupChilds, to preserve spacing / kerning effects.
8 preserve_group_with_lone_atom_spacing_char Spacing {+}, {,}, {*}_N, {·}m Singleton groups containing only one math atom-spacing character.
9 preserve_group_starting_with_atom_spacing_char Spacing {+x}, {,y} Multi-child GroupChilds whose first child is a math atom-spacing character.
10 preserve_group_containing_delimited_pair Spacing {\left( a \right)} GroupChilds whose subtree contains a \left…\right delimited group.

Atom-spacing characters: = < > + - , : ; . / * ! ? | ·.

Sub-flag

Field Depends on Example Effect
preserve_group_after_scripted_command_like preserve_group_adjacent_to_command_like \sin^2{x} When classifying "command-like" for the adjacency check, recurse through Scripted bases. Disabled, \sin^2 (a Scripted node) is no longer treated as command-like and the trailing {x} is flattened.

This sub-flag does not gate any group on its own; it only refines the classification used by guard #5. When preserve_group_adjacent_to_command_like is false, the sub-flag has no effect.

Preset values

The preserve guards are wired to presets via two named constants:

  • FlattenGroupsConfig::STRICT — all guards on. Used by AUTHORING and FAITHFUL.
  • FlattenGroupsConfig::STRUCTURAL_ONLY — only semantic guards on; all spacing guards off. Used by CORPUS and EQUIV.
Field Category AUTHORING / FAITHFUL (STRICT) CORPUS / EQUIV (STRUCTURAL_ONLY)
enabled
preserve_group_containing_declarative_command Semantic
preserve_group_in_script_base_slot Semantic
preserve_group_inside_env_body Semantic
preserve_group_containing_infix Semantic
preserve_group_adjacent_to_command_like Spacing
preserve_group_as_argument_of_command Spacing
preserve_group_after_scripted_command_like Spacing
preserve_empty_group Spacing
preserve_group_with_lone_atom_spacing_char Spacing
preserve_group_starting_with_atom_spacing_char Spacing
preserve_group_containing_delimited_pair Spacing

Additional constants: ENABLED (alias for STRICT), DISABLED (no flattening at all), DEFAULTS (= STRICT).

Reports

TransformReport aggregates per-phase reports for observability and diagnostics:

pub struct TransformReport {
    pub lower_attributes: LowerAttributesReport,
    pub rewrite: RewriteReport,
    pub finalize_ast: FinalizeAstReport,
    pub flatten_groups: FlattenGroupsReport,
}
  • LowerAttributesReportattributes (HashMap<AttributeSet, AttributeStat>) plus eliminated_empty_segments; each attribute stat has consumed, redundant, and emitted counts split into declaratives and prefixes. The report aggregates all LowerAttributes invocations in one transform run, so the default pipeline combines pre-Rewrite and post-Rewrite counts.
  • RewriteReportiterations (fixed-point iteration count) and rules (Vec<RewriteRuleStat> with key, applied_count, skipped_count per rule that was attempted at least once).
  • FinalizeAstReportsteps with one applied_count counter per cleanup step (currently merge_adjacent_primes).
  • FlattenGroupsReportactions for the four action counters and guards for one hit counter per preserve guard. Hit counters are short-circuit: when several guards would apply to the same group, only the first one that matches in the internal evaluation order is incremented.

The stable facade DTO used by the Python and WebAssembly bindings flattens the same information into a transport-safe shape:

{
  iterations,
  rules: [{ key, applied_count, skipped_count }],
  finalize_ast: {
    steps: { merge_adjacent_primes: { applied_count } }
  },
  flatten_groups: {
    actions: { removed_empty, replaced_single_child, inlined_multi_child, unwrapped_slot },
    guards: { preserve_group_* }
  },
  lower_attributes: {
    attributes: [{
      attr,
      value,
      consumed: { declaratives, prefixes },
      redundant: { declaratives, prefixes },
      emitted: { declaratives, prefixes }
    }],
    eliminated_empty_segments
  }
}

Phase internals

LowerAttributes

Two sub-modules drive this phase: lower_attributes/codegen.rs and build-time generated data emitted into OUT_DIR. The phase scans the AST for declarative commands (e.g. \bf, \large, \sf) and registered prefix wrappers (e.g. \mathbf{...}, \textbf{...}), then normalizes both forms into a single canonical representation per attribute.

Attributes are modeled as a structured AttributeSet (Attr × AttrValue) covering math font, math size, math style, text family, text series, text shape, and text size. Inherited state is tracked across container boundaries so that nested declarations, prefix wrappers, and empty trailing segments normalize cleanly.

The phase runs twice in the pipeline (pre and post Rewrite) under a single enabled switch because rewrite rules may emit prefix wrappers as their right-hand side; the post-pass re-canonicalizes those into the same normal form as the pre-pass. LowerAttributesReport uses a single cumulative counter set for both invocations.

Rewrite

Rules live under src/rewrite/rules/{base, ams, braket, physics}/ and are auto-registered through src/rewrite/rules/generated.rs (maintained by build.rs). Each rule is a unit struct implementing RewriteRule with a static RuleMeta descriptor.

RuleMeta declares:

  • key: RuleKey (package/name) — the stable identifier used in reports and build-time filters.
  • enabled_by_packages — packages whose presence in the ParseContext enables the rule.
  • level — see NormalizationLevel.
  • fidelity — see RuleFidelity.
  • triggersRuleTargets the scheduler watches to know when to attempt the rule.
  • consumes — forms the rule eliminates (must not appear in the output) and touches (may read or modify).
  • produces — forms the rule may introduce. The engine verifies every produced form is either accepted by the output contract or eliminated by another rule.

TransformContext::from_build_config builds a Plan by filtering all registered rules through the selected profile levels, build-time disabled rules, and the ParseContext's enabled packages. The plan is then driven by scheduler::drive_fixed_point until either no rule fires in an iteration or max_iterations is exceeded. When Rewrite is enabled, after the full pipeline has run post-Rewrite LowerAttributes and FlattenGroups, the engine uses collect_eliminated_violations to verify that no rule's eliminates set remains in the output AST; a violation is reported as RewriteError::ContractViolation.

See src/rewrite/rules/README.md for rule authoring conventions and the macro-based DSL used to define rules.

FinalizeAst

A single pass (src/finalize_ast/) for local AST cleanup that does not depend on rewrite metadata. Its current step merges adjacent Prime nodes produced by rewrite rules, so f^{\prime\prime} normalizes through Prime(1), Prime(1) into the same final shape as f''. The phase is enabled by default in every profile and gated by TransformConfig.finalize_ast.enabled.

FlattenGroups

A single recursive traversal (visittry_unwrap in src/flatten_groups/mod.rs). For each node the visitor:

  1. Collects subtree-wide flags (has_declarative, has_infix, has_delimited) on the way down.
  2. Tracks the in_env_body context flag through Slot::EnvBody edges.
  3. On the way back up, calls try_unwrap to check whether the current group should be flattened. Each preserve guard short-circuits with an early return that increments its hit counter; the first matching guard wins.
  4. If no guard fires and the group's content mode matches its parent's context mode, the group is unwrapped via either unwrap_group_child (multi-child splice) or redirect_single_child_slot (single-child slot replacement).

The slot_can_unwrap helper restricts redirect-style unwrapping to single-child groups in Argument, Script*, and Infix* slots; EnvBody slots are never unwrapped.

Errors

pub enum TransformError {
    Build(TransformBuildError),
    Rewrite(RewriteError),
}

pub enum TransformBuildError {
    Rewrite(PlanBuildError),
}

pub enum RewriteError {
    Rule { rule: RuleKey, kind: RuleError },
    ContractViolation { target: RuleTargetKey, node_name: Option<String> },
    MaxIterationsExceeded { max_iterations: usize },
}

pub enum RuleError {
    InvalidNodeShape { message: String },
    MissingMetadata { name: String },
}

TransformBuildError is raised by TransformContext::from_build_config when the rewrite plan cannot be assembled, for example when a required package is missing. All other errors surface during execution.

Tests

Integration tests cover the phases and their interactions:

  • tests/flatten_groups.rs — preserve-guard toggles, STRICT vs STRUCTURAL_ONLY, action and per-guard counters.
  • tests/lower_attributes.rs — declarative consumption, prefix wrapping, inherited-state absorption.
  • tests/finalize_ast.rs — adjacent-Prime merging and the phase gate.
  • tests/rewrite_rule.rs — single-rule execution and metadata contracts.
  • tests/rewrite_context.rs — the RuleContext AST view exposed to rules.
  • tests/transform_contract.rs — eliminated-form contract checking across the pipeline.
  • tests/config_model.rs — profile and config invariants.

Run with:

cargo test -p texform-transform

See also