pub struct ContentProcessor { /* private fields */ }Expand description
Content processor for HTML documents
Provides methods to clean, extract, and normalize HTML content.
Implementations§
Source§impl ContentProcessor
impl ContentProcessor
Sourcepub fn new(config: ContentProcessorConfig) -> Self
pub fn new(config: ContentProcessorConfig) -> Self
Create a new content processor with the given configuration
Sourcepub fn with_defaults() -> Self
pub fn with_defaults() -> Self
Create a content processor with default settings
Sourcepub fn with_max_length(max_length: usize) -> Self
pub fn with_max_length(max_length: usize) -> Self
Create a content processor with a maximum length limit
Sourcepub fn process(&self, raw_html: &str) -> ProcessedContent
pub fn process(&self, raw_html: &str) -> ProcessedContent
Process raw HTML and return cleaned content
This is the main entry point for content processing. It:
- Removes script, style, and other non-content elements
- Extracts text from remaining HTML
- Decodes HTML entities
- Normalizes whitespace
- Optionally truncates to max_length
Sourcepub fn extract_text(&self, html: &str) -> String
pub fn extract_text(&self, html: &str) -> String
Extract text content from HTML
Uses the scraper crate to parse HTML and extract text nodes, preserving paragraph structure if configured.
Sourcepub fn remove_scripts_styles(&self, html: &str) -> String
pub fn remove_scripts_styles(&self, html: &str) -> String
Remove script, style, and other non-content elements from HTML
This method performs a comprehensive cleanup of HTML by:
- Removing
<script>tags and their content - Removing
<style>tags and their content - Removing
<noscript>tags and their content - Removing HTML comments
- Removing other configured tags
Sourcepub fn normalize_whitespace(&self, text: &str) -> String
pub fn normalize_whitespace(&self, text: &str) -> String
Normalize whitespace in text
This method:
- Collapses multiple spaces into single spaces
- Normalizes different whitespace characters (tabs, nbsp, etc.)
- Preserves paragraph breaks (double newlines) if structure preservation is enabled
- Trims leading and trailing whitespace
Sourcepub fn truncate_with_ellipsis(&self, text: &str, max: usize) -> String
pub fn truncate_with_ellipsis(&self, text: &str, max: usize) -> String
Truncate text with ellipsis at a word boundary
This method truncates text to approximately the given maximum length, breaking at word boundaries to avoid cutting words in half. Appends “…” to indicate truncation.
Sourcepub fn decode_html_entities(text: &str) -> String
pub fn decode_html_entities(text: &str) -> String
Decode HTML entities in text
Handles common HTML entities including:
- Named entities (&, <, >, ", , etc.)
- Numeric entities (', ', etc.)
Trait Implementations§
Source§impl Clone for ContentProcessor
impl Clone for ContentProcessor
Source§fn clone(&self) -> ContentProcessor
fn clone(&self) -> ContentProcessor
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl Freeze for ContentProcessor
impl RefUnwindSafe for ContentProcessor
impl Send for ContentProcessor
impl Sync for ContentProcessor
impl Unpin for ContentProcessor
impl UnwindSafe for ContentProcessor
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more