Module normalize

Expand description

§Input Normalization

This module provides input normalization for markdown content before parsing. Normalization ensures that invisible control characters and other artifacts that can interfere with markdown parsing are handled consistently.

§Overview

Input text may contain invisible Unicode characters (especially from copy-paste) that interfere with markdown parsing. This module provides functions to:

Strip Unicode bidirectional formatting characters that break delimiter recognition
Fix HTML comment fences to preserve trailing text
Apply all normalizations in the correct order

Double chevrons (<< and >>) are passed through unchanged without conversion.

§Functions

strip_bidi_formatting - Remove Unicode bidi control characters
normalize_markdown - Apply all markdown-specific normalizations
normalize_fields - Normalize document fields (bidi stripping)

§Why Normalize?

Unicode bidirectional formatting characters (LRO, RLO, LRE, RLE, etc.) are invisible control characters used for bidirectional text layout. When placed adjacent to markdown delimiters like **, they can prevent parsers from recognizing the delimiters:

**bold** or <U+202D>**(1234**
            ^^^^^^^^ invisible LRO here prevents second ** from being recognized as bold

These characters commonly appear when copying text from:

Web pages with mixed LTR/RTL content
PDF documents
Word processors
Some clipboard managers

§Examples

use quillmark_core::normalize::strip_bidi_formatting;

// Input with invisible U+202D (LRO) before second **
let input = "**asdf** or \u{202D}**(1234**";
let cleaned = strip_bidi_formatting(input);
assert_eq!(cleaned, "**asdf** or **(1234**");

Enums§

NormalizationError: Errors that can occur during normalization

Functions§

fix_html_comment_fences: Fixes HTML comment closing fences to prevent content loss.
normalize_document: Normalizes a parsed document by applying all field-level normalizations.
normalize_field_name: Normalize field name to Unicode NFC (Canonical Decomposition, followed by Canonical Composition)
normalize_fields: Normalizes document fields by applying all preprocessing steps.
normalize_markdown: Normalizes markdown content by applying all preprocessing steps.
strip_bidi_formatting: Strips Unicode bidirectional formatting characters that can interfere with markdown parsing.