parsoid-rs
The parsoid
crate is a wrapper around Parsoid HTML
that provides convenient accessors for processing and extraction.
Inspired by mwparserfromhell, parsoid-jsapi and built on top of Servo's html5ever.
Architecture
Conceptually this crate is just a convience layer on top of an HTML processing
library. There are two primary constructs to be aware of: Wikicode
and
Wikinode
s (e.g. WikiLink
, Template
, etc.).
Wikicode
represents a container of an entire wiki page, equivalent to a
<html>
or <body>
node. It provides convenience functions like
filter_templates()
to easily operate on and mutate a specific Wikinode
(Template
in this case).
Wikinode
is an enum representing all of the different types of Wikinodes,
mostly to enable functions that accept/return various types of nodes.
A Wikinode provides convenience functions for working with specific
types of MediaWiki constructs. For example, the WikiLink
type wraps around
a node of <a rel="mw:WikiLink" href="...">...</a>
. It provides functions
for accessing or mutating the href
attribute. To access the link text
you would need to use .children()
and modify or append to those nodes.
The following nodes have been implemented so far:
Comment
(<!-- ... -->
)ExtLink
([https://example.org Text]
)WikiLink
([[Foo|bar]]
)Generic
- any node that we don't have a more specific type for.
Each Wikinode is effectively a wrapper around Rc<Node>
, making it cheap to
clone around (I think).
Testing
The featured_articles
example iterates through the first 500 featured
articles on the English Wikipedia to test the parsing code.
License
parsoid-rs is (C) 2020 Kunal Mehta, released under the GPL v2 or any later version, see COPYING for details.