wicket
A high-performance tool that extracts plain text from Wikipedia XML dump files.
wicket is a Rust reimplementation of wikiextractor, offering significantly faster processing through parallel execution and efficient streaming.
Features
- Streaming XML parsing that handles multi-gigabyte dumps without loading them into memory
- Parallel text extraction using multiple CPU cores via rayon
- Automatic bzip2 decompression for
.xml.bz2dump files - Output compatible with wikiextractor (doc format and JSON format)
- File splitting with configurable maximum size per file
- Namespace filtering to extract only specific page types
Installation
From crates.io
From source
Requires Rust 1.85 or later.
Quick Start
# Download a Wikipedia dump
# Extract plain text
# JSON output
# Write to stdout
|
CLI Options
| Option | Description | Default |
|---|---|---|
<INPUT> |
Input Wikipedia XML dump file (.xml or .xml.bz2) |
(required) |
-o, --output |
Output directory, or - for stdout |
text |
-b, --bytes |
Maximum bytes per output file (e.g., 1M, 500K, 1G) |
1M |
-c, --compress |
Compress output files using bzip2 | false |
--json |
Write output in JSON format | false |
--processes |
Number of parallel workers | CPU count |
-q, --quiet |
Suppress progress output on stderr | false |
--namespaces |
Comma-separated namespace IDs to extract | 0 |
Output Formats
Doc Format (default)
April is the fourth month of the year...
JSON Format
Library Usage
Add to your Cargo.toml:
[]
= "0.1.0"
use ;
Documentation
License
MIT OR Apache-2.0