pub struct WikiPage { /* private fields */ }Expand description
A reusable Wikipedia HTML parser.
Reusing a single WikiPage instance across multiple articles is more efficient
than creating one per article, since it avoids re-initialising the tree-sitter
parser and grammar on each call.
§Example
use wikipedia_article_transform::{WikiPage, ArticleItem};
let mut page = WikiPage::new().unwrap();
let items = page.extract_text("<p>Hello.</p>").unwrap();
if let ArticleItem::Paragraph(seg) = &items[0] {
assert_eq!(seg.text, "Hello.");
}Implementations§
Source§impl WikiPage
impl WikiPage
Sourcepub fn set_base_url(&mut self, language: &str)
pub fn set_base_url(&mut self, language: &str)
Set the base URL for resolving relative link hrefs.
Call this before [extract_text] when the HTML comes from a known origin.
The language parameter is a Wikipedia language code (e.g. "en", "ml").
use wikipedia_article_transform::WikiPage;
let mut page = WikiPage::new().unwrap();
page.set_base_url("en");Sourcepub fn extract_text(&mut self, html: &str) -> Result<Vec<ArticleItem>>
pub fn extract_text(&mut self, html: &str) -> Result<Vec<ArticleItem>>
Parses html and returns one ArticleItem per paragraph or image, in document order.
If any <ol class="references"> lists are found, a final
ArticleItem::References item is appended containing all citations.
The parser state is reset on each call, so the same WikiPage can be
reused safely across multiple articles.
Skipped elements: <script>, <style>, <link>, and elements with
classes shortdescription, hatnote, infobox, reference, navbox,
noprint, reflist, citation.
Sourcepub fn extract_text_plain(html: &str) -> Result<String>
pub fn extract_text_plain(html: &str) -> Result<String>
Convenience method: parse html and return all paragraph text joined by "\n\n".