wikipedia-article-transform
Extract plain text from Wikipedia article HTML using tree-sitter.
Parses the full HTML structure of a Wikipedia article and returns clean prose text — skipping
infoboxes, navigation elements, references, hatnotes, and other non-content markup. Section
headings are tracked so each paragraph knows which section it belongs to, and inline elements
(<b>, <i>, <a>) are captured for rich output formats.
Usage
As a library
[]
= "0.1"
Simple: get all text as a string
use WikiPage;
let html = r#"<h2>History</h2><p id="p1">Some text here.</p>"#;
let text = extract_text_plain?;
// "Some text here."
Structured: get paragraphs with section context
use WikiPage;
let mut page = new?;
let segments = page.extract_text?;
for seg in &segments
Reuse the same WikiPage across many articles — it resets state internally on each call
and avoids reinitialising the parser.
Filtering by section
let prose: = segments.iter
.filter
.collect;
Output formatting
use ArticleFormat;
// Plain text with # heading lines
let plain = segments.format_plain;
// Semantic JSON with per-paragraph citations:
// { "intro": [{ "text": "...", "citations": [{ "label": "1", "text": "..." }] }], "sections": [...] }
let json = segments.format_json?;
// Markdown with **bold**, _italic_, [links](href)
let markdown = segments.format_markdown;
With the fetch feature
Fetch and extract a live Wikipedia article directly:
[]
= { = "0.1", = ["cli"] }
use ;
let segments = get_text.await?;
println!;
Python bindings
Build locally with maturin:
Use in Python:
=
CLI
Install with the fetch feature (required for the binary):
or with uv
Fetch an article:
# Plain text (default)
# Semantic JSON section tree
# Markdown with inline formatting
# Titles with spaces are accepted (normalized internally)
Python CLI
After installing the Python package, use the same command shape:
Web API (web feature)
Run the HTTP API server:
Routes:
GET /healthz
GET /{language}/{title}.md
GET /{language}/{title}.txt
GET /{language}/{title}.json
Examples:
The server binds to 0.0.0.0:$PORT (PORT defaults to 10000) and sets output-specific content types.
JSON output shape
Skipped elements
The following are excluded from extracted text:
| Element / class | Reason |
|---|---|
<script>, <style> |
Code, not prose |
<link> |
Metadata |
.infobox |
Structured data table |
.reflist, .reference, .citation |
Reference list |
.navbox |
Navigation template |
.hatnote |
Disambiguation notice |
.shortdescription |
Hidden metadata |
.noprint |
Print-only elements |
Feature flags
| Feature | Default | Description |
|---|---|---|
cli |
no | Enables get_text() and the CLI binary (adds reqwest + tokio) |
web |
no | Enables the Actix API server binary (wikipedia-article-transform-web) |
Agent Skills
This tool is also available as an agentic skill. You may install it with:
npx skills add https://github.com/santhoshtr/wikipedia-article-transform
License
MIT