Skip to main content

MetadataConfig

Struct MetadataConfig 

Source
pub struct MetadataConfig {
    pub extract_document: bool,
    pub extract_headers: bool,
    pub extract_links: bool,
    pub extract_images: bool,
    pub extract_structured_data: bool,
    pub max_structured_data_size: usize,
}
Expand description

Configuration for metadata extraction granularity.

Controls which metadata types are extracted and size limits for safety. Enables selective extraction of different metadata categories from HTML documents, allowing fine-grained control over which types of information to collect during the HTML-to-Markdown conversion process.

§Fields

  • extract_document: Enable document-level metadata extraction (title, description, author, Open Graph, Twitter Card, etc.)
  • extract_headers: Enable heading element extraction (h1-h6) with hierarchy tracking
  • extract_links: Enable anchor element extraction with link type classification
  • extract_images: Enable image element extraction with source and dimension metadata
  • extract_structured_data: Enable structured data extraction (JSON-LD, Microdata, RDFa)
  • max_structured_data_size: Safety limit on total structured data size in bytes

§Examples

let config = MetadataConfig {
    extract_document: true,
    extract_headers: true,
    extract_links: true,
    extract_images: true,
    extract_structured_data: true,
    max_structured_data_size: 1_000_000,
};

assert!(config.extract_headers);

Fields§

§extract_document: bool

Extract document-level metadata (title, description, author, etc.).

When enabled, collects metadata from <head> section including:

  • <title> element content
  • <meta name="description"> and other standard meta tags
  • Open Graph (og:*) properties for social media optimization
  • Twitter Card (twitter:*) properties
  • Language and text direction attributes
  • Canonical URL and base href references
§extract_headers: bool

Extract h1-h6 header elements and their hierarchy.

When enabled, collects all heading elements with:

  • Header level (1-6)
  • Text content (normalized)
  • HTML id attribute if present
  • Document tree depth for hierarchy tracking
  • Byte offset in original HTML for positioning
§extract_links: bool

Extract anchor (a) elements as links with type classification.

When enabled, collects all hyperlinks with:

  • href attribute value
  • Link text content
  • Title attribute (tooltip text)
  • Automatic link type classification (anchor, internal, external, email, phone, other)
  • Rel attribute values
  • Additional custom attributes
§extract_images: bool

Extract image elements and data URIs.

When enabled, collects all image elements with:

  • Source URL or data URI
  • Alt text for accessibility
  • Title attribute
  • Dimensions (width, height) if available
  • Automatic image type classification (data URI, external, relative, inline SVG)
  • Additional custom attributes
§extract_structured_data: bool

Extract structured data (JSON-LD, Microdata, RDFa).

When enabled, collects machine-readable structured data including:

  • JSON-LD script blocks with schema detection
  • Microdata attributes (itemscope, itemtype, itemprop)
  • RDFa markup
  • Extracted schema type if detectable
§max_structured_data_size: usize

Maximum total size of structured data to collect (bytes).

Prevents memory exhaustion attacks on malformed or adversarial documents containing excessively large structured data blocks. When the accumulated size of structured data exceeds this limit, further collection stops. Default: 1_000_000 bytes (1 MB)

Implementations§

Source§

impl MetadataConfig

Source

pub const fn any_enabled(&self) -> bool

Check if any metadata extraction is enabled.

Returns true if at least one extraction category is enabled, false if all are disabled. This is useful for early exit optimization when the application doesn’t need metadata.

§Returns

true if any of the extraction flags are enabled, false if all are disabled.

§Examples
// All enabled
let config = MetadataConfig::default();
assert!(config.any_enabled());

// Selectively enabled
let config = MetadataConfig {
    extract_headers: true,
    extract_document: false,
    extract_links: false,
    extract_images: false,
    extract_structured_data: false,
    max_structured_data_size: 1_000_000,
};
assert!(config.any_enabled());

// All disabled
let config = MetadataConfig {
    extract_document: false,
    extract_headers: false,
    extract_links: false,
    extract_images: false,
    extract_structured_data: false,
    max_structured_data_size: 1_000_000,
};
assert!(!config.any_enabled());
Source

pub const fn apply_update(&mut self, update: MetadataConfigUpdate)

Apply a partial update to this metadata configuration.

Any specified fields in the update (Some values) will override the current values. Unspecified fields (None) are left unchanged. This allows selective modification of configuration without affecting unrelated settings.

§Arguments
  • update - Partial metadata config update with fields to override
§Examples
let mut config = MetadataConfig::default();
// config starts with all extraction enabled

let update = MetadataConfigUpdate {
    extract_document: Some(false),
    extract_images: Some(false),
    // All other fields are None, so they won't change
    ..Default::default()
};

config.apply_update(update);

assert!(!config.extract_document);
assert!(!config.extract_images);
assert!(config.extract_headers);  // Unchanged
assert!(config.extract_links);    // Unchanged
Source

pub fn from_update(update: MetadataConfigUpdate) -> Self

Create new metadata configuration from a partial update.

Creates a new MetadataConfig struct with defaults, then applies the update. Fields not specified in the update (None) keep their default values. This is a convenience method for constructing a configuration from a partial specification without needing to explicitly call .default() first.

§Arguments
  • update - Partial metadata config update with fields to set
§Returns

New MetadataConfig with specified updates applied to defaults

§Examples
let update = MetadataConfigUpdate {
    extract_document: Some(false),
    extract_headers: Some(true),
    extract_links: Some(true),
    extract_images: None,  // Will use default (true)
    extract_structured_data: None,  // Will use default (true)
    max_structured_data_size: None,  // Will use default (1MB)
};

let config = MetadataConfig::from_update(update);

assert!(!config.extract_document);
assert!(config.extract_headers);
assert!(config.extract_links);
assert!(config.extract_images);  // Default
assert!(config.extract_structured_data);  // Default

Trait Implementations§

Source§

impl Clone for MetadataConfig

Source§

fn clone(&self) -> MetadataConfig

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for MetadataConfig

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for MetadataConfig

Source§

fn default() -> Self

Create default metadata configuration.

Defaults to extracting all metadata types with 1MB limit on structured data.

Source§

impl<'de> Deserialize<'de> for MetadataConfig

Source§

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
Source§

impl From<MetadataConfigUpdate> for MetadataConfig

Source§

fn from(update: MetadataConfigUpdate) -> Self

Converts to this type from the input type.
Source§

impl Serialize for MetadataConfig

Source§

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>
where __S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,