Expand description
§Input Normalization
This module provides input normalization for markdown content before parsing. Normalization ensures that invisible control characters and other artifacts that can interfere with markdown parsing are handled consistently.
§Overview
Input text may contain invisible Unicode characters (especially from copy-paste) that interfere with markdown parsing. This module provides functions to:
- Strip Unicode bidirectional formatting characters that break delimiter recognition
- Orchestrate guillemet preprocessing (
<<text>>→«text») - Apply all normalizations in the correct order
§Functions
strip_bidi_formatting- Remove Unicode bidi control charactersnormalize_markdown- Apply all markdown-specific normalizationsnormalize_fields- Normalize document fields (bidi + guillemets)
§Why Normalize?
Unicode bidirectional formatting characters (LRO, RLO, LRE, RLE, etc.) are invisible
control characters used for bidirectional text layout. When placed adjacent to markdown
delimiters like **, they can prevent parsers from recognizing the delimiters:
**bold** or <U+202D>**(1234**
^^^^^^^^ invisible LRO here prevents second ** from being recognized as boldThese characters commonly appear when copying text from:
- Web pages with mixed LTR/RTL content
- PDF documents
- Word processors
- Some clipboard managers
§Examples
use quillmark_core::normalize::strip_bidi_formatting;
// Input with invisible U+202D (LRO) before second **
let input = "**asdf** or \u{202D}**(1234**";
let cleaned = strip_bidi_formatting(input);
assert_eq!(cleaned, "**asdf** or **(1234**");Enums§
- Normalization
Error - Errors that can occur during normalization
Functions§
- fix_
html_ comment_ fences - Fixes HTML comment closing fences to prevent content loss.
- normalize_
fields - Normalizes document fields by applying all preprocessing steps.
- normalize_
markdown - Normalizes markdown content by applying all preprocessing steps.
- strip_
bidi_ formatting - Strips Unicode bidirectional formatting characters that can interfere with markdown parsing.