pub struct MetadataConfig {
pub extract_document: bool,
pub extract_headers: bool,
pub extract_links: bool,
pub extract_images: bool,
pub extract_structured_data: bool,
pub max_structured_data_size: usize,
}Expand description
Configuration for metadata extraction granularity.
Controls which metadata types are extracted and size limits for safety. Enables selective extraction of different metadata categories from HTML documents, allowing fine-grained control over which types of information to collect during the HTML-to-Markdown conversion process.
§Fields
extract_document: Enable document-level metadata extraction (title, description, author, Open Graph, Twitter Card, etc.)extract_headers: Enable heading element extraction (h1-h6) with hierarchy trackingextract_links: Enable anchor element extraction with link type classificationextract_images: Enable image element extraction with source and dimension metadataextract_structured_data: Enable structured data extraction (JSON-LD, Microdata, RDFa)max_structured_data_size: Safety limit on total structured data size in bytes
§Examples
let config = MetadataConfig {
extract_document: true,
extract_headers: true,
extract_links: true,
extract_images: true,
extract_structured_data: true,
max_structured_data_size: 1_000_000,
};
assert!(config.extract_headers);Fields§
§extract_document: boolExtract document-level metadata (title, description, author, etc.).
When enabled, collects metadata from <head> section including:
<title>element content<meta name="description">and other standard meta tags- Open Graph (og:*) properties for social media optimization
- Twitter Card (twitter:*) properties
- Language and text direction attributes
- Canonical URL and base href references
extract_headers: boolExtract h1-h6 header elements and their hierarchy.
When enabled, collects all heading elements with:
- Header level (1-6)
- Text content (normalized)
- HTML id attribute if present
- Document tree depth for hierarchy tracking
- Byte offset in original HTML for positioning
extract_links: boolExtract anchor (a) elements as links with type classification.
When enabled, collects all hyperlinks with:
- href attribute value
- Link text content
- Title attribute (tooltip text)
- Automatic link type classification (anchor, internal, external, email, phone, other)
- Rel attribute values
- Additional custom attributes
extract_images: boolExtract image elements and data URIs.
When enabled, collects all image elements with:
- Source URL or data URI
- Alt text for accessibility
- Title attribute
- Dimensions (width, height) if available
- Automatic image type classification (data URI, external, relative, inline SVG)
- Additional custom attributes
extract_structured_data: boolExtract structured data (JSON-LD, Microdata, RDFa).
When enabled, collects machine-readable structured data including:
- JSON-LD script blocks with schema detection
- Microdata attributes (itemscope, itemtype, itemprop)
- RDFa markup
- Extracted schema type if detectable
max_structured_data_size: usizeMaximum total size of structured data to collect (bytes).
Prevents memory exhaustion attacks on malformed or adversarial documents
containing excessively large structured data blocks. When the accumulated
size of structured data exceeds this limit, further collection stops.
Default: 1_000_000 bytes (1 MB)
Implementations§
Source§impl MetadataConfig
impl MetadataConfig
Sourcepub const fn any_enabled(&self) -> bool
pub const fn any_enabled(&self) -> bool
Check if any metadata extraction is enabled.
Returns true if at least one extraction category is enabled, false if all are disabled.
This is useful for early exit optimization when the application doesn’t need metadata.
§Returns
true if any of the extraction flags are enabled, false if all are disabled.
§Examples
// All enabled
let config = MetadataConfig::default();
assert!(config.any_enabled());
// Selectively enabled
let config = MetadataConfig {
extract_headers: true,
extract_document: false,
extract_links: false,
extract_images: false,
extract_structured_data: false,
max_structured_data_size: 1_000_000,
};
assert!(config.any_enabled());
// All disabled
let config = MetadataConfig {
extract_document: false,
extract_headers: false,
extract_links: false,
extract_images: false,
extract_structured_data: false,
max_structured_data_size: 1_000_000,
};
assert!(!config.any_enabled());Sourcepub const fn apply_update(&mut self, update: MetadataConfigUpdate)
pub const fn apply_update(&mut self, update: MetadataConfigUpdate)
Apply a partial update to this metadata configuration.
Any specified fields in the update (Some values) will override the current values. Unspecified fields (None) are left unchanged. This allows selective modification of configuration without affecting unrelated settings.
§Arguments
update- Partial metadata config update with fields to override
§Examples
let mut config = MetadataConfig::default();
// config starts with all extraction enabled
let update = MetadataConfigUpdate {
extract_document: Some(false),
extract_images: Some(false),
// All other fields are None, so they won't change
..Default::default()
};
config.apply_update(update);
assert!(!config.extract_document);
assert!(!config.extract_images);
assert!(config.extract_headers); // Unchanged
assert!(config.extract_links); // UnchangedSourcepub fn from_update(update: MetadataConfigUpdate) -> Self
pub fn from_update(update: MetadataConfigUpdate) -> Self
Create new metadata configuration from a partial update.
Creates a new MetadataConfig struct with defaults, then applies the update.
Fields not specified in the update (None) keep their default values.
This is a convenience method for constructing a configuration from a partial specification
without needing to explicitly call .default() first.
§Arguments
update- Partial metadata config update with fields to set
§Returns
New MetadataConfig with specified updates applied to defaults
§Examples
let update = MetadataConfigUpdate {
extract_document: Some(false),
extract_headers: Some(true),
extract_links: Some(true),
extract_images: None, // Will use default (true)
extract_structured_data: None, // Will use default (true)
max_structured_data_size: None, // Will use default (1MB)
};
let config = MetadataConfig::from_update(update);
assert!(!config.extract_document);
assert!(config.extract_headers);
assert!(config.extract_links);
assert!(config.extract_images); // Default
assert!(config.extract_structured_data); // DefaultTrait Implementations§
Source§impl Clone for MetadataConfig
impl Clone for MetadataConfig
Source§fn clone(&self) -> MetadataConfig
fn clone(&self) -> MetadataConfig
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more