wicket 0.1.0

Wikipedia corpus knowledge extractor.
Documentation

wicket

A high-performance tool that extracts plain text from Wikipedia XML dump files.

wicket is a Rust reimplementation of wikiextractor, offering significantly faster processing through parallel execution and efficient streaming.

Features

  • Streaming XML parsing that handles multi-gigabyte dumps without loading them into memory
  • Parallel text extraction using multiple CPU cores via rayon
  • Automatic bzip2 decompression for .xml.bz2 dump files
  • Output compatible with wikiextractor (doc format and JSON format)
  • File splitting with configurable maximum size per file
  • Namespace filtering to extract only specific page types

Installation

From crates.io

cargo install wicket-cli

From source

Requires Rust 1.85 or later.

git clone https://github.com/mosuka/wext.git
cd wext
cargo build --release

Quick Start

# Download a Wikipedia dump
wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

# Extract plain text
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/

# JSON output
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json

# Write to stdout
wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50

CLI Options

Option Description Default
<INPUT> Input Wikipedia XML dump file (.xml or .xml.bz2) (required)
-o, --output Output directory, or - for stdout text
-b, --bytes Maximum bytes per output file (e.g., 1M, 500K, 1G) 1M
-c, --compress Compress output files using bzip2 false
--json Write output in JSON format false
--processes Number of parallel workers CPU count
-q, --quiet Suppress progress output on stderr false
--namespaces Comma-separated namespace IDs to extract 0

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Library Usage

Add to your Cargo.toml:

[dependencies]
wicket = "0.1.0"
use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader.take(5) {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        println!("{}", output);
    }

    Ok(())
}

Documentation

License

MIT OR Apache-2.0