wikiparse-rs 0.1.2

Blazingly fast WikiMedia/Wikipedia SQL dumps parser
Documentation
  • Coverage
  • 0%
    0 out of 316 items documented0 out of 102 items with examples
  • Size
  • Source code size: 77.66 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 13.63 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 27s Average build duration of successful builds.
  • all releases: 27s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • erykksc/wikiparse-rs
    2 1 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • erykksc

wikiparse-rs

blazingly fast

wikiparse-rs is a blazingly fast CLI and library written in Rust for streaming parsed uncompressed MediaWiki/Wikipedia SQL dumps. It reads INSERT rows from supported Wikipedia tables and exports them as CSV or JSON.

Install

Install as a CLI tool from crates.io:

cargo install wikiparse-rs

Install as a CLI tool from GitHub:

cargo install --git https://github.com/erykksc/wikiparse-rs wikiparse-rs

Install as a CLI tool from a local checkout (from this repository root):

cargo install --path .

Use as a library in another Rust project:

cargo add wikiparse-rs

Import it in Rust as wikiparse_rs.

Quick usage

Export a table dump to CSV:

wikiparse-rs --table page --format csv --input /path/to/page.sql > page.csv

Export from stdin (default when --input is omitted):

cat /path/to/page.sql | wikiparse-rs --table page --format csv > page.csv

Export a table dump to JSON:

wikiparse-rs --table linktarget --format json --input /path/to/linktarget.sql > linktarget.json

Example Usage

Iterate over typed page rows and destructure fields inline:

use std::fs::File;
use std::io::{self, BufReader};

use wikiparse_rs::parsers::page::{PageRow, iter_rows};

fn main() -> io::Result<()> {
    let file = File::open("page.sql")?;
    let reader = BufReader::new(file);

    for row in iter_rows(reader).take(10) {
        let PageRow {
            id: page_id,
            title: name,
            ..
        } = row?;
        println!("{page_id},{}", String::from_utf8_lossy(&name));
    }

    Ok(())
}

CLI command

The wikiparse-rs binary is designed for scriptable dump export.

  • --table: which supported MediaWiki table to parse (for example page, pagelinks, linktarget)
  • --format: output format, csv or json
  • --input: path to the SQL dump file, or - for stdin (defaults to stdin when omitted)
  • --limit: optional row limit for quick sampling

This makes the command useful as a standalone binary to transform large SQL dumps into CSV/JSON for downstream tools.

Compressed dumps can be streamed without extracting first:

zcat /path/to/page.sql.gz | wikiparse-rs --table page --format csv > page.csv

Show progress while streaming a compressed dump with pv:

pv /path/to/page.sql.gz | zcat | wikiparse-rs --table page --format csv > page.csv

Column selection with xsv

After exporting CSV, you can select only the columns you need with xsv:

wikiparse-rs --table page --format csv --input /path/to/page.sql \
  | xsv select page_id,page_title,page_namespace