ContentProcessor

Struct ContentProcessor 

Source
pub struct ContentProcessor { /* private fields */ }
Expand description

Content processor for HTML documents

Provides methods to clean, extract, and normalize HTML content.

Implementations§

Source§

impl ContentProcessor

Source

pub fn new(config: ContentProcessorConfig) -> Self

Create a new content processor with the given configuration

Source

pub fn with_defaults() -> Self

Create a content processor with default settings

Source

pub fn with_max_length(max_length: usize) -> Self

Create a content processor with a maximum length limit

Source

pub fn process(&self, raw_html: &str) -> ProcessedContent

Process raw HTML and return cleaned content

This is the main entry point for content processing. It:

  1. Removes script, style, and other non-content elements
  2. Extracts text from remaining HTML
  3. Decodes HTML entities
  4. Normalizes whitespace
  5. Optionally truncates to max_length
Source

pub fn extract_text(&self, html: &str) -> String

Extract text content from HTML

Uses the scraper crate to parse HTML and extract text nodes, preserving paragraph structure if configured.

Source

pub fn remove_scripts_styles(&self, html: &str) -> String

Remove script, style, and other non-content elements from HTML

This method performs a comprehensive cleanup of HTML by:

  • Removing <script> tags and their content
  • Removing <style> tags and their content
  • Removing <noscript> tags and their content
  • Removing HTML comments
  • Removing other configured tags
Source

pub fn normalize_whitespace(&self, text: &str) -> String

Normalize whitespace in text

This method:

  • Collapses multiple spaces into single spaces
  • Normalizes different whitespace characters (tabs, nbsp, etc.)
  • Preserves paragraph breaks (double newlines) if structure preservation is enabled
  • Trims leading and trailing whitespace
Source

pub fn truncate_with_ellipsis(&self, text: &str, max: usize) -> String

Truncate text with ellipsis at a word boundary

This method truncates text to approximately the given maximum length, breaking at word boundaries to avoid cutting words in half. Appends “…” to indicate truncation.

Source

pub fn decode_html_entities(text: &str) -> String

Decode HTML entities in text

Handles common HTML entities including:

  • Named entities (&, <, >, ",  , etc.)
  • Numeric entities (', ', etc.)

Trait Implementations§

Source§

impl Clone for ContentProcessor

Source§

fn clone(&self) -> ContentProcessor

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for ContentProcessor

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> FromRef<T> for T
where T: Clone,

Source§

fn from_ref(input: &T) -> T

Converts to this type from a reference to the input type.
Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more