Expand description
§Input Normalization
This module provides input normalization for markdown content before parsing. Normalization ensures that invisible control characters and other artifacts that can interfere with markdown parsing are handled consistently.
§Overview
Input text may contain invisible Unicode characters (especially from copy-paste) that interfere with markdown parsing. This module provides functions to:
- Strip Unicode bidirectional formatting characters that break delimiter recognition
- Fix HTML comment fences to preserve trailing text
- Apply all normalizations in the correct order
Double chevrons (<< and >>) are passed through unchanged without conversion.
§Functions
strip_bidi_formatting- Remove Unicode bidi control charactersnormalize_markdown- Apply all markdown-specific normalizationsnormalize_fields- Normalize document fields (bidi stripping)
§Why Normalize?
Unicode bidirectional formatting characters (LRO, RLO, LRE, RLE, etc.) are invisible
control characters used for bidirectional text layout. When placed adjacent to markdown
delimiters like **, they can prevent parsers from recognizing the delimiters:
**bold** or <U+202D>**(1234**
^^^^^^^^ invisible LRO here prevents second ** from being recognized as boldThese characters commonly appear when copying text from:
- Web pages with mixed LTR/RTL content
- PDF documents
- Word processors
- Some clipboard managers
§Examples
use quillmark_core::normalize::strip_bidi_formatting;
// Input with invisible U+202D (LRO) before second **
let input = "**asdf** or \u{202D}**(1234**";
let cleaned = strip_bidi_formatting(input);
assert_eq!(cleaned, "**asdf** or **(1234**");Enums§
- Normalization
Error - Errors that can occur during normalization
Functions§
- fix_
html_ comment_ fences - Fixes HTML comment closing fences to prevent content loss.
- normalize_
document - Normalizes a parsed document by applying all field-level normalizations.
- normalize_
field_ name - Normalize field name to Unicode NFC (Canonical Decomposition, followed by Canonical Composition)
- normalize_
fields - Normalizes document fields by applying all preprocessing steps.
- normalize_
markdown - Normalizes markdown content by applying all preprocessing steps.
- strip_
bidi_ formatting - Strips Unicode bidirectional formatting characters that can interfere with markdown parsing.