Module normalize

Module normalize 

Source
Expand description

§Input Normalization

This module provides input normalization for markdown content before parsing. Normalization ensures that invisible control characters and other artifacts that can interfere with markdown parsing are handled consistently.

§Overview

Input text may contain invisible Unicode characters (especially from copy-paste) that interfere with markdown parsing. This module provides functions to:

  • Strip Unicode bidirectional formatting characters that break delimiter recognition
  • Orchestrate guillemet preprocessing (<<text>>«text»)
  • Apply all normalizations in the correct order

§Functions

§Why Normalize?

Unicode bidirectional formatting characters (LRO, RLO, LRE, RLE, etc.) are invisible control characters used for bidirectional text layout. When placed adjacent to markdown delimiters like **, they can prevent parsers from recognizing the delimiters:

**bold** or <U+202D>**(1234**
            ^^^^^^^^ invisible LRO here prevents second ** from being recognized as bold

These characters commonly appear when copying text from:

  • Web pages with mixed LTR/RTL content
  • PDF documents
  • Word processors
  • Some clipboard managers

§Examples

use quillmark_core::normalize::strip_bidi_formatting;

// Input with invisible U+202D (LRO) before second **
let input = "**asdf** or \u{202D}**(1234**";
let cleaned = strip_bidi_formatting(input);
assert_eq!(cleaned, "**asdf** or **(1234**");

Enums§

NormalizationError
Errors that can occur during normalization

Functions§

normalize_fields
Normalizes document fields by applying all preprocessing steps.
normalize_markdown
Normalizes markdown content by applying all preprocessing steps.
strip_bidi_formatting
Strips Unicode bidirectional formatting characters that can interfere with markdown parsing.