wicket

A high-performance tool that extracts plain text from Wikipedia XML dump files.

wicket is a Rust reimplementation of wikiextractor, offering significantly faster processing through parallel execution and efficient streaming.

Features

Streaming XML parsing that handles multi-gigabyte dumps without loading them into memory
Parallel text extraction using multiple CPU cores via rayon
Automatic bzip2 decompression for .xml.bz2 dump files
Output compatible with wikiextractor (doc format and JSON format)
File splitting with configurable maximum size per file
Namespace filtering to extract only specific page types

Installation

From crates.io

cargo install wicket-cli

From source

Requires Rust 1.85 or later.

git clone https://github.com/mosuka/wext.git
cd wext
cargo build --release

Quick Start

# Download a Wikipedia dump
wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

# Extract plain text
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/

# JSON output
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json

# Write to stdout
wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50

CLI Options

Option	Description	Default
`<INPUT>`	Input Wikipedia XML dump file (`.xml` or `.xml.bz2`)	(required)
`-o, --output`	Output directory, or `-` for stdout	`text`
`-b, --bytes`	Maximum bytes per output file (e.g., `1M`, `500K`, `1G`)	`1M`
`-c, --compress`	Compress output files using bzip2	`false`
`--json`	Write output in JSON format	`false`
`--processes`	Number of parallel workers	CPU count
`-q, --quiet`	Suppress progress output on stderr	`false`
`--namespaces`	Comma-separated namespace IDs to extract	`0`

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Library Usage

Add to your Cargo.toml:

[dependencies]
wicket = "0.1.0"

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader.take(5) {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        println!("{}", output);
    }

    Ok(())
}

Documentation

License

MIT OR Apache-2.0

wicket 0.1.0