Expand description
Canonical UTF-8 BOM handling — single source of truth.
§Why this module exists
Earlier iterations spread BOM-aware logic across three layers:
- A
bom_filtercallback on the lexer that decided which BOMs were “leading” (skip) vs. “mid-file” (emit as an error token). - An indent-walker arm in
tokenize()that tried to keep mid-file BOMs layout-transparent by adjustinglast_newline_end. - A
source.starts_with('\u{FEFF}')check informat_sourceto decide whether to re-prepend a BOM on output.
Each of those three encoded the same concept (“the source had a leading BOM”) with a different predicate. Every round-trip-fidelity bug we hit traced back to two of them disagreeing — most recently a pair of regressions where the lexer accepted a leading-whitespace- then-BOM input but the formatter dropped the BOM on output.
§The architecture
A UTF-8 BOM is a serialization concern, not a parsing concern. It carries no semantic meaning to beancount. So:
strip_leadingruns ONCE at the parser’s public entry. The lexer, parser, indent walker, and every other internal layer operate on a source that is BOM-free by construction.- The parser records whether a BOM was stripped in
ParseResult::has_leading_bom. That flag is the only source of truth downstream. No layer inspects the BOM byte directly. restore_leadingruns ONCE at the formatter’s public exit, gated on the flag, restoring byte-stable round-trip identity.
§Mid-file BOMs
Because the leading BOM is stripped before lexing, any U+FEFF byte
the lexer encounters is by construction mid-file and unrecognized.
Logos produces a Token::Error for it, and the parser’s existing
error classifier (error_text.contains('\u{FEFF}')) surfaces the
dedicated diagnostic.
§Span coordinates
The parser preserves the original-source coordinate frame for all
spans it returns: if a directive starts at byte 3 of the original
source (because the file began with a 3-byte BOM), its span starts
at 3. The parser shifts every span up by BOM_LEN after running the
inner parser on the stripped source. Callers (LSP, FFI, doctor) see
coordinates that index into the source they passed in, with no need
to be BOM-aware themselves.
Constants§
- BOM
- The UTF-8 byte-order mark (
EF BB BF). - BOM_
CHAR - The same BOM as a
char. - BOM_LEN
- Byte length of
BOMin UTF-8 (always 3).
Functions§
- restore_
leading - Re-prepend a leading BOM if
had_bom. Idempotent: a call whereformattedalready starts with a BOM returns the input unchanged. - strip_
leading - Strip a strict-byte-0 leading BOM, returning
(stripped, had_bom).