Skip to main content

Module bom

Module bom 

Source
Expand description

Canonical UTF-8 BOM handling — single source of truth.

§Why this module exists

Earlier iterations spread BOM-aware logic across three layers:

  1. A bom_filter callback on the lexer that decided which BOMs were “leading” (skip) vs. “mid-file” (emit as an error token).
  2. An indent-walker arm in tokenize() that tried to keep mid-file BOMs layout-transparent by adjusting last_newline_end.
  3. A source.starts_with('\u{FEFF}') check in format_source to decide whether to re-prepend a BOM on output.

Each of those three encoded the same concept (“the source had a leading BOM”) with a different predicate. Every round-trip-fidelity bug we hit traced back to two of them disagreeing — most recently a pair of regressions where the lexer accepted a leading-whitespace- then-BOM input but the formatter dropped the BOM on output.

§The architecture

A UTF-8 BOM is a serialization concern, not a parsing concern. It carries no semantic meaning to beancount. So:

  • strip_leading runs ONCE at the parser’s public entry. The lexer, parser, indent walker, and every other internal layer operate on a source that is BOM-free by construction.
  • The parser records whether a BOM was stripped in ParseResult::has_leading_bom. That flag is the only source of truth downstream. No layer inspects the BOM byte directly.
  • restore_leading runs ONCE at the formatter’s public exit, gated on the flag, restoring byte-stable round-trip identity.

§Mid-file BOMs

Because the leading BOM is stripped before lexing, any U+FEFF byte the lexer encounters is by construction mid-file and unrecognized. Logos produces a Token::Error for it, and the parser’s existing error classifier (error_text.contains('\u{FEFF}')) surfaces the dedicated diagnostic.

§Span coordinates

The parser preserves the original-source coordinate frame for all spans it returns: if a directive starts at byte 3 of the original source (because the file began with a 3-byte BOM), its span starts at 3. The parser shifts every span up by BOM_LEN after running the inner parser on the stripped source. Callers (LSP, FFI, doctor) see coordinates that index into the source they passed in, with no need to be BOM-aware themselves.

Constants§

BOM
The UTF-8 byte-order mark (EF BB BF).
BOM_CHAR
The same BOM as a char.
BOM_LEN
Byte length of BOM in UTF-8 (always 3).

Functions§

restore_leading
Re-prepend a leading BOM if had_bom. Idempotent: a call where formatted already starts with a BOM returns the input unchanged.
strip_leading
Strip a strict-byte-0 leading BOM, returning (stripped, had_bom).