Crate parsoid[][src]

Expand description

parsoid-rs

The parsoid crate is a wrapper around Parsoid HTML that provides convenient accessors for processing and extraction.

Inspired by mwparserfromhell, parsoid-jsapi and built on top of Kuchiki (朽木).

Quick starts

Fetch HTML and extract the value of a template parameter:

use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor_Swift").await?.into_mutable();
for template in code.filter_templates()? {
    if template.name() == "Template:Infobox person" {
        let birth_name = template.param("birth_name").unwrap();
        assert_eq!(birth_name, "Taylor Alison Swift");
    }
}

Add a link to a page and convert it to wikitext:

use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Wikipedia:Sandbox").await?.into_mutable();
let link = WikiLink::new(
    "./Special:Random",
    &Wikicode::new_text("Visit a random page")
);
code.append(&link);
let wikitext = client.transform_to_wikitext(&code).await?;
assert!(wikitext.ends_with("[[Special:Random|Visit a random page]]"));

This crate provides no functionality for actually saving a page, you’ll need to use something like mwbot.

Architecture

Conceptually this crate provides wiki-related types on top of an HTML processing library. There are three primary constructs to be aware of: Wikicode, Wikinode, and Template.

Wikicode represents a container of an entire wiki page, equivalent to a <html> or <body> node. It some provides convenience functions like filter_links() to easily operate on and mutate a specific Wikinode. (For convenience, Wikicode is also a Wikinode.)

use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor Swift").await?.into_mutable();
for link in code.filter_links() {
    if link.target() == "You Belong with Me" {
        // ...do something
    }
}

Filter functions are only provided for common types as an optimization, but it’s straightforward to implement for other types:

use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor Swift").await?.into_mutable();
let entities: Vec<HtmlEntity> = code
    .descendants()
    .filter_map(|node| node.as_html_entity())
    .collect();

Wikinode is an enum representing all of the different types of Wikinodes, mostly to enable functions that accept/return various types of nodes.

A Wikinode provides convenience functions for working with specific types of MediaWiki constructs. For example, the WikiLink type wraps around a node of <a rel="mw:WikiLink" href="...">...</a>. It provides functions for accessing or mutating the href attribute. To access the link text you would need to use .children() and modify or append to those nodes. Standard mutators like .append() and .insert_after() are part of the WikinodeIterator trait, which is automatically imported in the prelude.

The following nodes have been implemented so far:

  • BehaviorSwitch: __TOC__, {{DISPLAYTITLE:}}
  • Category: [[Category:Foo]]
  • Comment: <!-- ... -->
  • ExtLink: [https://example.org Text]
  • Heading: == Some text ==
  • HtmlEntity: &nbsp;
  • IncludeOnly: <includeonly>foo</includeonly>
  • InterwikiLink: [[:en:Foo]]
  • LanguageLink: [[en:Foo]]
  • Nowiki: <nowiki>[[foo]]</nowiki>
  • Redirect: #REDIRECT [[Foo]]
  • Section: Contains a Heading and its contents
  • WikiLink: [[Foo|bar]]
  • Generic - any node that we don’t have a more specific type for.

Each Wikinode is effectively a wrapper around Rc<Node>, making it cheap to clone around.

Templates

Unlike Wikinodes, Templates do not have a 1:1 mapping with a HTML node, it’s possible to have multiple templates in one node. The main way to get Template instances is to call Wikicode::filter_templates().

See the Template documentation for more details and examples.

noinclude and onlyinclude

Similar to Templates, <noinclude> and <onlyinclude> do not have a 1:1 mapping with a single HTML node, as they may span multiple. The main way to get NoInclude or OnlyInclude instances is to call filter_noinclude() and filter_onlyinclude() respectively.

See the module-level documentation for more details and examples.

Safety

This library is implemented using only safe Rust and should not panic. However, the HTML is expected to meet some level of well-formedness. For example, if a node has rel="mw:WikiLink", it is assumed it is an <a> element. This is not designed to be fully defensive for arbitrary HTML and should only be used with HTML from Parsoid itself or mutated by this or another similar library (contributions to improve this will gladly be welcomed!).

Additionally Wikicode does not implement Send, which means it cannot be safely shared across threads. This is a limitation of the underlying kuchiki library being used.

A ImmutableWikicode is provided as a workaround - it is Send and contains all the same information Wikicode does, but is immutable. Switching between the two is straightforward by using into_immutable() and into_mutable() or by using the standard From and Into traits.

Contributing

parsoid is a part of the mwbot-rs project. We’re always looking for new contributors, please reach out if you’re interested!

Modules

Re-export of IndexMap

Wikinodes represent various MediaWiki syntax constructs

Prelude to import to pull in traits and useful types

Structs

HTTP client to get Parsoid HTML from MediaWiki’s Rest APIs

An immutable version of Wikicode that implements Send and Sync. It can be used interchangably with a normal Wikicode for API methods.

Represents a MediaWiki template ({{foo}})

Container for HTML, usually represents the entire page

Enums

Primary error class

Traits

Trait for Wikinodes that actually span multiple elements

Collection of iterators and mutators that allow operating on a tree of Wikinodes

Type Definitions