parsoid 0.3.0-alpha.2

Wrapper around Parsoid HTML that provides convenient accessors for processing and manipulation
Documentation

parsoid-rs

crates.io docs.rs pipeline status coverage report

The parsoid crate is a wrapper around Parsoid HTML that provides convenient accessors for processing and extraction.

Inspired by mwparserfromhell, parsoid-jsapi and built on top of Servo's html5ever.

Architecture

Conceptually this crate is just a convience layer on top of an HTML processing library. There are two primary constructs to be aware of: Wikicode and Wikinodes (e.g. WikiLink, Template, etc.).

Wikicode represents a container of an entire wiki page, equivalent to a <html> or <body> node. It provides convenience functions like filter_templates() to easily operate on and mutate a specific Wikinode (Template in this case).

Wikinode is an enum representing all of the different types of Wikinodes, mostly to enable functions that accept/return various types of nodes.

A Wikinode provides convenience functions for working with specific types of MediaWiki constructs. For example, the WikiLink type wraps around a node of <a rel="mw:WikiLink" href="...">...</a>. It provides functions for accessing or mutating the href attribute. To access the link text you would need to use .children() and modify or append to those nodes.

The following nodes have been implemented so far:

  • Comment (<!-- ... -->)
  • ExtLink ([https://example.org Text])
  • WikiLink ([[Foo|bar]])
  • Generic - any node that we don't have a more specific type for.

Each Wikinode is effectively a wrapper around Rc<Node>, making it cheap to clone around (I think).

Testing

The featured_articles example iterates through the first 500 featured articles on the English Wikipedia to test the parsing code.

License

parsoid-rs is (C) 2020 Kunal Mehta, released under the GPL v2 or any later version, see COPYING for details.