Expand description
Text extraction and processing for halldyll-parser
This module handles:
- Text extraction from HTML documents
- Boilerplate removal (nav, footer, ads, etc.)
- Text cleaning and normalization
- Readability scoring
- Language detection (basic)
Functionsยง
- clean_
text - Clean text by removing control characters
- count_
sentences - Count sentences in text
- count_
words - Count words in text
- detect_
language - Simple language detection based on common words Returns ISO 639-1 language code or None
- extract_
text - Extract main text content from HTML document
- flesch_
kincaid_ grade - Calculate Flesch-Kincaid Grade Level Returns US school grade level needed to understand text
- flesch_
reading_ ease - Calculate Flesch-Kincaid Reading Ease score Higher score = easier to read (0-100+)
- is_
inline_ element - Check if element is inline
- normalize_
text - Normalize text (collapse whitespace, trim)
- strip_
html_ tags - Strip HTML tags from text (for cases where we have HTML strings)