parsoid 0.9.1

Wrapper around Parsoid HTML that provides convenient accessors for processing and manipulation
Documentation
# parsoid

[![crates.io](https://img.shields.io/crates/v/parsoid.svg)](https://crates.io/crates/parsoid)
[![docs.rs](https://img.shields.io/docsrs/parsoid?label=docs.rs)](https://docs.rs/parsoid)
[![docs (main)](https://img.shields.io/badge/doc.wikimedia.org-green?label=docs%40main)](https://doc.wikimedia.org/mwbot-rs/mwbot/parsoid/)
[![pipeline status](https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/badges/main/pipeline.svg)](https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/commits/main)
[![coverage report](https://img.shields.io/endpoint?url=https%3A%2F%2Fdoc.wikimedia.org%2Fcover%2Fmwbot-rs%2Fmwbot%2Fcoverage%2Fcoverage.json)](https://doc.wikimedia.org/cover/mwbot-rs/mwbot/coverage)

The `parsoid` crate is a wrapper around [Parsoid HTML](https://www.mediawiki.org/wiki/Specs/HTML/2.8.0)
that provides convenient accessors for processing and extraction.

Inspired by [mwparserfromhell](https://github.com/earwig/mwparserfromhell/),
[parsoid-jsapi](https://github.com/wikimedia/parsoid-jsapi) and built on top
of [Kuchiki (朽木)](https://github.com/kuchiki-rs/kuchiki).

## Quick starts

Fetch HTML and extract the value of a template parameter:
```rust
use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor_Swift").await?.into_mutable();
for template in code.filter_templates()? {
    if template.name() == "Template:Infobox person" {
        let birth_name = template.param("birth_name").unwrap();
        assert_eq!(birth_name, "Taylor Alison Swift");
    }
}
```

Add a link to a page and convert it to wikitext:
```rust
use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Wikipedia:Sandbox").await?.into_mutable();
let link = WikiLink::new(
    "./Special:Random",
    &Wikicode::new_text("Visit a random page")
);
code.append(&link);
let wikitext = client.transform_to_wikitext(&code).await?;
assert!(wikitext.ends_with("[[Special:Random|Visit a random page]]"));
```
This crate provides no functionality for actually saving a page, you'll
need to use something like [`mwbot`](https://docs.rs/mwbot).

### Architecture
Conceptually this crate provides wiki-related types on top of an HTML processing
library. There are three primary constructs to be aware of: `Wikicode`,
`Wikinode`, and `Template`.

`Wikicode` represents a container of an entire wiki page, equivalent to a
`<html>` or `<body>` node. It provides some convenience functions like
`filter_links()` to easily operate on and mutate a specific Wikinode.
(For convenience, `Wikicode` is also a `Wikinode`.)

```rust
use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor Swift").await?.into_mutable();
for link in code.filter_links() {
    if link.target() == "You Belong with Me" {
        // ...do something
    }
}
```

Filter functions are only provided for common types as an optimization,
but it's straightforward to implement for other types:
```rust
use parsoid::prelude::*;
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor Swift").await?.into_mutable();
let entities: Vec<HtmlEntity> = code
    .descendants()
    .filter_map(|node| node.as_html_entity())
    .collect();
```

`Wikinode` is an enum representing all of the different types of Wikinodes,
mostly to enable functions that accept/return various types of nodes.

A Wikinode provides convenience functions for working with specific
types of MediaWiki constructs. For example, the `WikiLink` type wraps around
a node of `<a rel="mw:WikiLink" href="...">...</a>`. It provides functions
for accessing or mutating the `href` attribute. To access the link text
you would need to use `.children()` and modify or append to those nodes.
Standard mutators like `.append()` and `.insert_after()` are part of the
`WikinodeIterator` trait, which is automatically imported in the prelude.

The following nodes have been implemented so far:
* `BehaviorSwitch`: `__TOC__`
* `Category`: `[[Category:Foo]]`
* `Comment`: `<!-- ... -->`
* `ExtLink`: `[https://example.org Text]`
* `Heading`: `== Some text ==`
* `HtmlEntity`: `&nbsp;`
* `IncludeOnly`: `<includeonly>foo</includeonly>`
* `InterwikiLink`: `[[:en:Foo]]`
* `LanguageLink`: `[[en:Foo]]`
* `Nowiki`: `<nowiki>[[foo]]</nowiki>`
* `Redirect`: `#REDIRECT [[Foo]]`
* `Section`: Contains a `Heading` and its contents
* `WikiLink`: `[[Foo|bar]]`
* `Generic` - any node that we don't have a more specific type for.

Each Wikinode is effectively a wrapper around `Rc<Node>`, making it cheap to
clone around.

### Templates
Unlike Wikinodes, Templates do not have a 1:1 mapping with a HTML node, it's
possible to have multiple templates in one node. The main way to get
`Template` instances is to call `Wikicode::filter_templates()`.

See the [`Template`](./struct.Template.html) documentation for more details
and examples.

### noinclude and onlyinclude
Similar to Templates, `<noinclude>` and `<onlyinclude>` do not have a
1:1 mapping with a single HTML node, as they may span multiple. The main
way to get `NoInclude` or `OnlyInclude` instances is to call
`filter_noinclude()` and `filter_onlyinclude()` respectively.

See the [module-level](./inclusion/) documentation for more details and
examples.

### Safety
This library is implemented using only safe Rust and should not panic.
However, the HTML is expected to meet some level of well-formedness. For
example, if a node has `rel="mw:WikiLink"`, it is assumed it is an `<a>`
element. This is not designed to be fully defensive for arbitrary HTML
and should only be used with HTML from Parsoid itself or mutated by
this or another similar library (contributions to improve this will gladly
be welcomed!).

Additionally `Wikicode` does not implement [`Send`](https://doc.rust-lang.org/std/marker/trait.Send.html),
which means it cannot be safely shared across threads. This is a
limitation of the underlying kuchikiki library being used.

A `ImmutableWikicode` is provided as a workaround - it is `Send` and
contains all the same information `Wikicode` does, but is immutable.
Switching between the two is straightforward by using `into_immutable()` and
`into_mutable()` or by using the standard `From` and `Into` traits.

### Testing
Use the `build_corpus` example to download the first 500 featured articles
on the English Wikipedia to create a test corpus.

The `featured_articles` example will iterate through those downloaded examples
to test the parsing code, clean roundtripping, etc.

### Contributing
`parsoid` is a part of the [`mwbot-rs` project](https://www.mediawiki.org/wiki/Mwbot-rs).
We're always looking for new contributors, please [reach out](https://www.mediawiki.org/wiki/Mwbot-rs#Contributing)
if you're interested!

## License
This crate is released under GPL-3.0-or-later.
See [COPYING](./COPYING) for details.