Module text

Expand description

Text extraction and processing for halldyll-parser

This module handles:

Functions§

clean_text: Clean text by removing control characters
count_sentences: Count sentences in text
count_words: Count words in text
detect_language: Simple language detection based on common words Returns ISO 639-1 language code or None
extract_text: Extract main text content from HTML document
flesch_kincaid_grade: Calculate Flesch-Kincaid Grade Level Returns US school grade level needed to understand text
flesch_reading_ease: Calculate Flesch-Kincaid Reading Ease score Higher score = easier to read (0-100+)
is_inline_element: Check if element is inline
normalize_text: Normalize text (collapse whitespace, trim)
strip_html_tags: Strip HTML tags from text (for cases where we have HTML strings)