wikipedia-article-transform 0.4.0

Transform Wikipedia articles in html to plaintext and markdown formats
Documentation

wikipedia-article-transform

Cargo PyPI

Extract plain text from Wikipedia article HTML using tree-sitter.

Parses the full HTML structure of a Wikipedia article and returns clean prose text — skipping infoboxes, navigation elements, references, hatnotes, and other non-content markup. Section headings are tracked so each paragraph knows which section it belongs to, and inline elements (<b>, <i>, <a>) are captured for rich output formats.

Usage

As a library

[dependencies]
wikipedia-article-transform = "0.1"

Simple: get all text as a string

use wikipedia_article_transform::WikiPage;

let html = r#"<h2>History</h2><p id="p1">Some text here.</p>"#;
let text = WikiPage::extract_text_plain(html)?;
// "Some text here."

Structured: get paragraphs with section context

use wikipedia_article_transform::WikiPage;

let mut page = WikiPage::new()?;
let segments = page.extract_text(html)?;

for seg in &segments {
    println!("[{}] {}", seg.section, seg.text);
}

Reuse the same WikiPage across many articles — it resets state internally on each call and avoids reinitialising the parser.

Filtering by section

let prose: Vec<_> = segments.iter()
    .filter(|s| s.section.starts_with("History"))
    .collect();

Output formatting

use wikipedia_article_transform::ArticleFormat;

// Plain text with # heading lines
let plain = segments.format_plain();

// Semantic JSON with per-paragraph citations:
// { "intro": [{ "text": "...", "citations": [{ "label": "1", "text": "..." }] }], "sections": [...] }
let json = segments.format_json()?;

// Markdown with **bold**, _italic_, [links](href)
let markdown = segments.format_markdown();

With the fetch feature

Fetch and extract a live Wikipedia article directly:

[dependencies]
wikipedia-article-transform = { version = "0.1", features = ["cli"] }
use wikipedia_article_transform::{get_text, ArticleFormat};

let segments = get_text("en", "Rust_(programming_language)").await?;
println!("{}", segments.format_markdown());

Python bindings

Build locally with maturin:

cd python
pip install maturin
maturin develop --release

Use in Python:

from wikipedia_article_transform import fetch_article_html, extract

html = fetch_article_html("en", "Rust_(programming_language)")
print(extract(html, format="markdown", language="en"))

CLI

Install with the fetch feature (required for the binary):

cargo install wikipedia-article-transform --features cli

or with uv

uv tool install wikipedia-article-transform

Fetch an article:

# Plain text (default)
wikipedia-article-transform fetch --language en --title "Rust_(programming_language)"

# Semantic JSON section tree
wikipedia-article-transform fetch --language ml --title "കേരളം" --format json

# Markdown with inline formatting
wikipedia-article-transform fetch --language en --title "Liquid_oxygen" --format markdown

# Titles with spaces are accepted (normalized internally)
wikipedia-article-transform fetch --language en --title "Marie Curie" --format json

Python CLI

After installing the Python package, use the same command shape:

wikipedia-article-transform fetch --language en --title "Rust_(programming_language)"
wikipedia-article-transform fetch --language ml --title "കേരളം" --format json
wikipedia-article-transform fetch --language en --title "Liquid_oxygen" --format markdown

Web API (web feature)

Run the HTTP API server:

cargo run --features web --bin wikipedia-article-transform-web

Routes:

GET /healthz
GET /{language}/{title}.md
GET /{language}/{title}.txt
GET /{language}/{title}.json

Examples:

curl "http://localhost:10000/en/Oxygen.md"
curl "http://localhost:10000/en/Oxygen.txt"
curl "http://localhost:10000/en/Oxygen.json"

The server binds to 0.0.0.0:$PORT (PORT defaults to 10000) and sets output-specific content types.

JSON output shape

{
  "intro": [
    {
      "text": "Paragraphs before the first heading...",
      "citations": []
    }
  ],
  "sections": [
    {
      "heading": "Safety and precautions",
      "level": 2,
      "paragraphs": [
        {
          "text": "Overview text...",
          "citations": [
            { "label": "3", "text": "Reference text..." }
          ]
        }
      ],
      "subsections": [
        {
          "heading": "Combustion and other hazards",
          "level": 3,
          "paragraphs": [
            {
              "text": "Liquid oxygen spills...",
              "citations": []
            }
          ],
          "subsections": []
        }
      ]
    }
  ],
  "references": {
    "cite_note-Example-3": "Reference text..."
  }
}

Skipped elements

The following are excluded from extracted text:

Element / class Reason
<script>, <style> Code, not prose
<link> Metadata
.infobox Structured data table
.reflist, .reference, .citation Reference list
.navbox Navigation template
.hatnote Disambiguation notice
.shortdescription Hidden metadata
.noprint Print-only elements

Feature flags

Feature Default Description
cli no Enables get_text() and the CLI binary (adds reqwest + tokio)
web no Enables the Actix API server binary (wikipedia-article-transform-web)

Agent Skills

This tool is also available as an agentic skill. You may install it with:

npx skills add https://github.com/santhoshtr/wikipedia-article-transform

License

MIT