Module text

Module text 

Source
Expand description

Text extraction and processing for halldyll-parser

This module handles:

  • Text extraction from HTML documents
  • Boilerplate removal (nav, footer, ads, etc.)
  • Text cleaning and normalization
  • Readability scoring
  • Language detection (basic)

Functionsยง

clean_text
Clean text by removing control characters
count_sentences
Count sentences in text
count_words
Count words in text
detect_language
Simple language detection based on common words Returns ISO 639-1 language code or None
extract_text
Extract main text content from HTML document
flesch_kincaid_grade
Calculate Flesch-Kincaid Grade Level Returns US school grade level needed to understand text
flesch_reading_ease
Calculate Flesch-Kincaid Reading Ease score Higher score = easier to read (0-100+)
is_inline_element
Check if element is inline
normalize_text
Normalize text (collapse whitespace, trim)
strip_html_tags
Strip HTML tags from text (for cases where we have HTML strings)