Expand description
Extract plain text from Wikipedia article HTML.
This crate parses Wikipedia article HTML using tree-sitter and extracts clean, structured plain text — skipping navigation, infoboxes, references, and other non-prose content.
§Quick start
use wikipedia_article_transform::WikiPage;
let html = r#"<html><body><p id="intro">Hello world.</p></body></html>"#;
let text = WikiPage::extract_text_plain(html).unwrap();
assert_eq!(text, "Hello world.");For richer output with section tracking and inline structure, use WikiPage::extract_text:
use wikipedia_article_transform::{WikiPage, ArticleItem};
let html = r#"<html><body><h2>History</h2><p id="p1">Some text.</p></body></html>"#;
let mut page = WikiPage::new().unwrap();
let items = page.extract_text(html).unwrap();
if let ArticleItem::Paragraph(seg) = &items[0] {
assert_eq!(seg.section, "History");
assert_eq!(seg.section_level, 2);
assert_eq!(seg.text, "Some text.");
}§Optional feature: fetch
Enable the fetch feature to fetch Wikipedia articles directly via the REST API:
wikipedia-article-transform = { version = "0.1", features = ["cli"] }Re-exports§
pub use formatters::ArticleFormat;
Modules§
- formatters
- Output formatters for Wikipedia article items.
Structs§
- Image
Segment - An image extracted from a
<figure>block in a Wikipedia article. - Text
Segment - A single paragraph-level text segment extracted from a Wikipedia article.
- Wiki
Page - A reusable Wikipedia HTML parser.
Enums§
- Article
Item - A single item extracted from a Wikipedia article, in document order.
- Inline
Node - An inline content node within a paragraph.
Functions§
- strip_
references - Remove all reference-related content from a list of
ArticleItems.