Skip to main content

WikiPage

Struct WikiPage 

Source
pub struct WikiPage { /* private fields */ }
Expand description

A reusable Wikipedia HTML parser.

Reusing a single WikiPage instance across multiple articles is more efficient than creating one per article, since it avoids re-initialising the tree-sitter parser and grammar on each call.

§Example

use wikipedia_article_transform::{WikiPage, ArticleItem};

let mut page = WikiPage::new().unwrap();
let items = page.extract_text("<p>Hello.</p>").unwrap();
if let ArticleItem::Paragraph(seg) = &items[0] {
    assert_eq!(seg.text, "Hello.");
}

Implementations§

Source§

impl WikiPage

Source

pub fn new() -> Result<Self>

Creates a new WikiPage, initialising the tree-sitter HTML parser.

Source

pub fn set_base_url(&mut self, language: &str)

Set the base URL for resolving relative link hrefs.

Call this before [extract_text] when the HTML comes from a known origin. The language parameter is a Wikipedia language code (e.g. "en", "ml").

use wikipedia_article_transform::WikiPage;

let mut page = WikiPage::new().unwrap();
page.set_base_url("en");
Source

pub fn extract_text(&mut self, html: &str) -> Result<Vec<ArticleItem>>

Parses html and returns one ArticleItem per paragraph or image, in document order.

If any <ol class="references"> lists are found, a final ArticleItem::References item is appended containing all citations.

The parser state is reset on each call, so the same WikiPage can be reused safely across multiple articles.

Skipped elements: <script>, <style>, <link>, and elements with classes shortdescription, hatnote, infobox, reference, navbox, noprint, reflist, citation.

Source

pub fn extract_text_plain(html: &str) -> Result<String>

Convenience method: parse html and return all paragraph text joined by "\n\n".

Trait Implementations§

Source§

impl Default for WikiPage

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.