Module normalize

Expand description

§Input Normalization

This module provides input normalization for markdown content before parsing. Normalization ensures that invisible control characters and other artifacts that can interfere with markdown parsing are handled consistently.

§Overview

Input text may contain invisible Unicode characters (especially from copy-paste) that interfere with markdown parsing. This module provides functions to:

Strip Unicode bidirectional formatting characters that break delimiter recognition
Orchestrate guillemet preprocessing (<<text>> → «text»)
Apply all normalizations in the correct order

§Functions

strip_bidi_formatting - Remove Unicode bidi control characters
normalize_markdown - Apply all markdown-specific normalizations
normalize_fields - Normalize document fields (bidi + guillemets)

§Why Normalize?

Unicode bidirectional formatting characters (LRO, RLO, LRE, RLE, etc.) are invisible control characters used for bidirectional text layout. When placed adjacent to markdown delimiters like **, they can prevent parsers from recognizing the delimiters:

**bold** or <U+202D>**(1234**
            ^^^^^^^^ invisible LRO here prevents second ** from being recognized as bold

These characters commonly appear when copying text from:

Web pages with mixed LTR/RTL content
PDF documents
Word processors
Some clipboard managers

§Examples

use quillmark_core::normalize::strip_bidi_formatting;

// Input with invisible U+202D (LRO) before second **
let input = "**asdf** or \u{202D}**(1234**";
let cleaned = strip_bidi_formatting(input);
assert_eq!(cleaned, "**asdf** or **(1234**");

Enums§

NormalizationError: Errors that can occur during normalization

Functions§

normalize_fields: Normalizes document fields by applying all preprocessing steps.
normalize_markdown: Normalizes markdown content by applying all preprocessing steps.
strip_bidi_formatting: Strips Unicode bidirectional formatting characters that can interfere with markdown parsing.