Expand description
§dom-content-extraction
A library for extracting main content from HTML documents using text density analysis. This crate implements the Content Extraction via Text Density (CETD) algorithm described in the paper by Fei Sun, Dandan Song and Lejian Liao.
§Overview
Web pages typically contain various elements beyond the main content, such as navigation menus, advertisements, sidebars, and footers. This library helps identify and extract the main content by analyzing text density patterns within the HTML document structure.
The core concept is that content-rich sections of a webpage tend to have different text density characteristics compared to navigational or peripheral elements. By building a density tree and applying composite text density calculations, we can identify and extract the main content regions.
§Main Components
DensityTree
: The primary structure representing text density analysis of an HTML documentDensityNode
: Individual nodes in the density tree containing text metrics- Helper functions for node text extraction and link analysis
§Basic Usage
use dom_content_extraction::{DensityTree, scraper::Html};
// Parse your HTML document
let html_content = "<html><body><article>Main content</article></body></html>";
let document = Html::parse_document(html_content);
// Create and analyze density tree
let mut dtree = DensityTree::from_document(&document)?;
// Calculate density sums for better content identification
dtree.calculate_density_sum()?;
// Extract the main content
let content = dtree.extract_content(&document)?;
println!("{}", content);
§Advanced Usage
For more precise control, you can work directly with the density-sorted nodes:
use dom_content_extraction::{DensityTree, get_node_text, scraper::Html};
let document = Html::parse_document("<html>...</html>");
let dtree = DensityTree::from_document(&document)?;
// Get nodes sorted by density
let sorted_nodes = dtree.sorted_nodes();
// Process the densest nodes
for node in sorted_nodes.iter().rev().take(3) {
println!("Node density: {}", node.density);
let text = get_node_text(node.node_id, &document)?;
println!("Node content: {}", text);
}
§Algorithm Details
The content extraction process involves several steps:
- Building a density tree that mirrors the HTML document structure
- Calculating text density metrics for each node:
- Character count
- Tag count
- Link character count
- Link tag count
- Computing composite text density using a formula that considers:
- Text to tag ratio
- Link density
- Content distribution
- Identifying high-density regions that likely contain main content
§Error Handling
The library uses custom error types to handle various failure cases:
- [
DomExtractionError::NoBodyElement
]: When the HTML document lacks a body tag DomExtractionError::NodeAccessError
: When a node cannot be accessed in the tree
§Performance Considerations
- The library performs a full traversal of the HTML document to build the density tree
- Memory usage scales with document size and complexity
- Text density calculations are performed once and cached
- Node sorting operations are O(n log n) where n is the number of content nodes
§Feature Flags
Currently, no optional features are provided. All functionality is included in the default build.
§Examples
More examples can be found in the examples/
directory of the source repository:
check.rs
: Basic content extraction from test documentsce_score.rs
: Evaluation tool for measuring extraction accuracy
§References
- Sun, F., Song, D., & Liao, L. (2011). “DOM Based Content Extraction via Text Density”
- CleanEval dataset: https://sigwac.org.uk/cleaneval/
Re-exports§
pub use cetd::DensityNode;
pub use cetd::DensityTree;
pub use utils::get_node_links;
pub use utils::get_node_text;
pub use scraper;