Crate wikipedia_article_transform

Expand description

Extract plain text from Wikipedia article HTML.

This crate parses Wikipedia article HTML using tree-sitter and extracts clean, structured plain text — skipping navigation, infoboxes, references, and other non-prose content.

§Quick start

use wikipedia_article_transform::WikiPage;

let html = r#"<html><body><p id="intro">Hello world.</p></body></html>"#;
let text = WikiPage::extract_text_plain(html).unwrap();
assert_eq!(text, "Hello world.");

For richer output with section tracking and inline structure, use WikiPage::extract_text:

use wikipedia_article_transform::{WikiPage, ArticleItem};

let html = r#"<html><body><h2>History</h2><p id="p1">Some text.</p></body></html>"#;
let mut page = WikiPage::new().unwrap();
let items = page.extract_text(html).unwrap();
if let ArticleItem::Paragraph(seg) = &items[0] {
    assert_eq!(seg.section, "History");
    assert_eq!(seg.section_level, 2);
    assert_eq!(seg.text, "Some text.");
}

§Optional feature: `fetch`

Enable the fetch feature to fetch Wikipedia articles directly via the REST API:

wikipedia-article-transform = { version = "0.1", features = ["cli"] }

Re-exports§

pub use formatters::ArticleFormat;

Modules§

formatters: Output formatters for Wikipedia article items.

Structs§

ImageSegment: An image extracted from a <figure> block in a Wikipedia article.
TextSegment: A single paragraph-level text segment extracted from a Wikipedia article.
WikiPage: A reusable Wikipedia HTML parser.

Enums§

ArticleItem: A single item extracted from a Wikipedia article, in document order.
InlineNode: An inline content node within a paragraph.

Functions§

strip_references: Remove all reference-related content from a list of ArticleItems.

Crate wikipedia_article_transform

Crate wikipedia_article_transform Copy item path

§Quick start

§Optional feature: fetch

Re-exports§

Modules§

Structs§

Enums§

Functions§

Crate wikipedia_article_transform

§Optional feature: `fetch`