Crate dom_content_extraction

Help

Expand description

§dom-content-extraction

A library for extracting main content from HTML documents using text density analysis. This crate implements the Content Extraction via Text Density (CETD) algorithm described in the paper by Fei Sun, Dandan Song and Lejian Liao.

§Overview

Web pages typically contain various elements beyond the main content, such as navigation menus, advertisements, sidebars, and footers. This library helps identify and extract the main content by analyzing text density patterns within the HTML document structure.

The core concept is that content-rich sections of a webpage tend to have different text density characteristics compared to navigational or peripheral elements. By building a density tree and applying composite text density calculations, we can identify and extract the main content regions.

§Main Components

DensityTree: The primary structure representing text density analysis of an HTML document
DensityNode: Individual nodes in the density tree containing text metrics
Helper functions for node text extraction and link analysis

§Basic Usage

use dom_content_extraction::{DensityTree, scraper::Html};

// Parse your HTML document
let html_content = "<html><body><article>Main content</article></body></html>";
let document = Html::parse_document(html_content);

// Create and analyze density tree
let mut dtree = DensityTree::from_document(&document)?;

// Calculate density sums for better content identification
dtree.calculate_density_sum()?;

// Extract the main content
let content = dtree.extract_content(&document)?;
println!("{}", content);

§Advanced Usage

For more precise control, you can work directly with the density-sorted nodes:

use dom_content_extraction::{DensityTree, get_node_text, scraper::Html};

let document = Html::parse_document("<html>...</html>");
let dtree = DensityTree::from_document(&document)?;

// Get nodes sorted by density
let sorted_nodes = dtree.sorted_nodes();

// Process the densest nodes
for node in sorted_nodes.iter().rev().take(3) {
    println!("Node density: {}", node.density);
    let text = get_node_text(node.node_id, &document)?;
    println!("Node content: {}", text);
}

§Algorithm Details

The content extraction process involves several steps:

Building a density tree that mirrors the HTML document structure
Calculating text density metrics for each node:
- Character count
- Tag count
- Link character count
- Link tag count
Computing composite text density using a formula that considers:
- Text to tag ratio
- Link density
- Content distribution
Identifying high-density regions that likely contain main content

§Error Handling

The library uses custom error types to handle various failure cases:

[DomExtractionError::NoBodyElement]: When the HTML document lacks a body tag
DomExtractionError::NodeAccessError: When a node cannot be accessed in the tree

§Performance Considerations

The library performs a full traversal of the HTML document to build the density tree
Memory usage scales with document size and complexity
Text density calculations are performed once and cached
Node sorting operations are O(n log n) where n is the number of content nodes

§Feature Flags

Currently, no optional features are provided. All functionality is included in the default build.

§Examples

More examples can be found in the examples/ directory of the source repository:

check.rs: Basic content extraction from test documents
ce_score.rs: Evaluation tool for measuring extraction accuracy

§References

Sun, F., Song, D., & Liao, L. (2011). “DOM Based Content Extraction via Text Density”
CleanEval dataset: https://sigwac.org.uk/cleaneval/

Re-exports§

pub use cetd::DensityNode;
pub use cetd::DensityTree;
pub use utils::get_node_links;
pub use utils::get_node_text;
pub use scraper;

Modules§

cetd
tree
unicode
utils

Enums§

DomExtractionError

Functions§

get_content

Crate dom_content_extractionCopy item path