Module bom

Expand description

Canonical UTF-8 BOM handling — single source of truth.

§Why this module exists

Earlier iterations spread BOM-aware logic across three layers:

A bom_filter callback on the lexer that decided which BOMs were “leading” (skip) vs. “mid-file” (emit as an error token).
An indent-walker arm in tokenize() that tried to keep mid-file BOMs layout-transparent by adjusting last_newline_end.
A source.starts_with('\u{FEFF}') check in format_source to decide whether to re-prepend a BOM on output.

Each of those three encoded the same concept (“the source had a leading BOM”) with a different predicate. Every round-trip-fidelity bug we hit traced back to two of them disagreeing — most recently a pair of regressions where the lexer accepted a leading-whitespace- then-BOM input but the formatter dropped the BOM on output.

§The architecture

A UTF-8 BOM is a serialization concern, not a parsing concern. It carries no semantic meaning to beancount. So:

strip_leading runs ONCE at the parser’s public entry. The lexer, parser, indent walker, and every other internal layer operate on a source that is BOM-free by construction.
The parser records whether a BOM was stripped in ParseResult::has_leading_bom. That flag is the only source of truth downstream. No layer inspects the BOM byte directly.
restore_leading runs ONCE at the formatter’s public exit, gated on the flag, restoring byte-stable round-trip identity.

§Mid-file BOMs

Because the leading BOM is stripped before lexing, any U+FEFF byte the lexer encounters is by construction mid-file and unrecognized. Logos produces a Token::Error for it, and the parser’s existing error classifier (error_text.contains('\u{FEFF}')) surfaces the dedicated diagnostic.

§Span coordinates

The parser preserves the original-source coordinate frame for all spans it returns: if a directive starts at byte 3 of the original source (because the file began with a 3-byte BOM), its span starts at 3. The parser shifts every span up by BOM_LEN after running the inner parser on the stripped source. Callers (LSP, FFI, doctor) see coordinates that index into the source they passed in, with no need to be BOM-aware themselves.

Constants§

BOM: The UTF-8 byte-order mark (EF BB BF).
BOM_CHAR: The same BOM as a char.
BOM_LEN: Byte length of BOM in UTF-8 (always 3).

Functions§

restore_leading: Re-prepend a leading BOM if had_bom. Idempotent: a call where formatted already starts with a BOM returns the input unchanged.
strip_leading: Strip a strict-byte-0 leading BOM, returning (stripped, had_bom).

Module bom

Module bom Copy item path

§Why this module exists

§The architecture

§Mid-file BOMs

§Span coordinates

Constants§

Functions§

Module bom