Skip to main content

Crate wikipedia_article_transform

Crate wikipedia_article_transform 

Source
Expand description

Extract plain text from Wikipedia article HTML.

This crate parses Wikipedia article HTML using tree-sitter and extracts clean, structured plain text — skipping navigation, infoboxes, references, and other non-prose content.

§Quick start

use wikipedia_article_transform::WikiPage;

let html = r#"<html><body><p id="intro">Hello world.</p></body></html>"#;
let text = WikiPage::extract_text_plain(html).unwrap();
assert_eq!(text, "Hello world.");

For richer output with section tracking and inline structure, use WikiPage::extract_text:

use wikipedia_article_transform::{WikiPage, ArticleItem};

let html = r#"<html><body><h2>History</h2><p id="p1">Some text.</p></body></html>"#;
let mut page = WikiPage::new().unwrap();
let items = page.extract_text(html).unwrap();
if let ArticleItem::Paragraph(seg) = &items[0] {
    assert_eq!(seg.section, "History");
    assert_eq!(seg.section_level, 2);
    assert_eq!(seg.text, "Some text.");
}

§Optional feature: fetch

Enable the fetch feature to fetch Wikipedia articles directly via the REST API:

wikipedia-article-transform = { version = "0.1", features = ["cli"] }

Re-exports§

pub use formatters::ArticleFormat;

Modules§

formatters
Output formatters for Wikipedia article items.

Structs§

ImageSegment
An image extracted from a <figure> block in a Wikipedia article.
TextSegment
A single paragraph-level text segment extracted from a Wikipedia article.
WikiPage
A reusable Wikipedia HTML parser.

Enums§

ArticleItem
A single item extracted from a Wikipedia article, in document order.
InlineNode
An inline content node within a paragraph.

Functions§

strip_references
Remove all reference-related content from a list of ArticleItems.