parse_wiki_text/
lib.rs

1// Copyright 2019 Fredrik Portström <https://portstrom.com>
2// This is free software distributed under the terms specified in
3// the file LICENSE at the top-level directory of this distribution.
4
5//! Parse wiki text from Mediawiki into a tree of elements.
6//!
7//! # Introduction
8//!
9//! Wiki text is a format that follows the PHP maxim “Make everything as inconsistent and confusing as possible”. There are hundreds of millions of interesting documents written in this format, distributed under free licenses on sites that use the Mediawiki software, mainly Wikipedia and Wiktionary. Being able to parse wiki text and process these documents would allow access to a significant part of the world's knowledge.
10//!
11//! The Mediawiki software itself transforms a wiki text document into an HTML document in an outdated format to be displayed in a browser for a human reader. It does so through a [step by step procedure](https://www.mediawiki.org/wiki/Manual:Parser.php) of string substitutions, with some of the steps depending on the result of previous steps. [The main file for this procedure](https://doc.wikimedia.org/mediawiki-core/master/php/Parser_8php_source.html) has 6200 lines of code and the [second biggest file](https://doc.wikimedia.org/mediawiki-core/master/php/Preprocessor__DOM_8php_source.html) has 2000, and then there is a [1400 line file](https://doc.wikimedia.org/mediawiki-core/master/php/ParserOptions_8php_source.html) just to take options for the parser.
12//!
13//! What would be more interesting is to parse the wiki text document into a structure that can be used by a computer program to reason about the facts in the document and present them in different ways, making them available for a great variety of applications.
14//!
15//! Some people have tried to parse wiki text using regular expressions. This is incredibly naive and fails as soon as the wiki text is non-trivial. The capabilities of regular expressions don't come anywhere close to the complexity of the weirdness required to correctly parse wiki text. One project did a brave attempt to use a parser generator to parse wiki text. Wiki text was however never designed for formal parsers, so even parser generators are of no help in correctly parsing wiki text.
16//!
17//! Wiki text has a long history of poorly designed additions carelessly piled on top of each other. The syntax of wiki text is different in each wiki depending on its configuration. You can't even know what's a start tag until you see the corresponding end tag, and you can't know where the end tag is unless you parse the entire hierarchy of nested tags between the start tag and the end tag. In short: If you think you understand wiki text, you don't understand wiki text.
18//!
19//! Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions.
20//!
21//! # Design goals
22//!
23//! ## Correctness
24//!
25//! Parse Wiki Text is designed to parse wiki text exactly as parsed by Mediawiki. Even when there is obviously a bug in Mediawiki, Parse Wiki Text replicates that exact bug. If there is something Parse Wiki Text doesn't parse exactly the same as Mediawiki, please report it as an issue.
26//!
27//! ## Speed
28//!
29//! Parse Wiki Text is designed to parse a page in as little time as possible. It parses tens of thousands of pages per second on each processor core and can quickly parse an entire wiki with millions of pages. If there is anything that can be changed to make Parse Wiki Text faster, please report it as an issue.
30//!
31//! ## Safety
32//!
33//! Parse Wiki Text is designed to work with untrusted inputs. If any input doesn't parse safely with reasonable resources, please report it as an issue. No unsafe code is used.
34//!
35//! ## Platform support
36//!
37//! Parse Wiki Text is designed to run in a wide variety of environments, such as:
38//!
39//! - servers running machine code
40//! - browsers running Web Assembly
41//! - embedded in other programming languages
42//!
43//! Parse Wiki Text can be deployed anywhere with no dependencies.
44//!
45//! # Caution
46//!
47//! Wiki text is a legacy format used by legacy software. Parse Wiki Text is intended only to recover information that has been written for wikis running legacy software, replicating the exact bugs found in the legacy software. Please don't use wiki text as a format for new applications. Wiki text is a horrible format with an astonishing amount of inconsistencies, bad design choices and bugs. For new applications, please use a format that is designed to be easy to process, such as JSON or even better [CBOR](http://cbor.io). See [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) for an example of a wiki that uses JSON as its format and provides a rich interface for editing data instead of letting people write code. If you need to take information written in wiki text and reuse it in a new application, you can use Parse Wiki Text to convert it to an intermediate format that you can further process into a modern format.
48//!
49//! # Site configuration
50//!
51//! Wiki text has plenty of features that are parsed in a way that depends on the configuration of the wiki. This means the configuration must be known before parsing.
52//!
53//! - External links are parsed only when the scheme of the URI of the link is in the configured list of valid protocols. When the scheme is not valid, the link is parsed as plain text.
54//! - Categories and images superficially look they same way as links, but are parsed differently. These can only be distinguished by knowing the namespace aliases from the configuration of the wiki.
55//! - Text matching the configured set of magic words is parsed as magic words.
56//! - Extension tags have the same syntax as HTML tags, but are parsed differently. The configuration tells which tag names are to be treated as extension tags.
57//!
58//! The configuration can be seen by making a request to the [site info](https://www.mediawiki.org/wiki/API:Siteinfo) resource on the wiki. The utility [Fetch site configuration](https://github.com/portstrom/fetch_site_configuration) fetches the parts of the configuration needed for parsing pages in the wiki, and outputs Rust code for instantiating a parser with that configuration. Parse Wiki Text contains a default configuration that can be used for testing.
59//!
60//! # Limitations
61//!
62//! Wiki text was never designed to be possible to parse into a structured format. It's designed to be parsed in multiple passes, where each pass depends on the output on the previous pass. Most importantly, templates are expanded in an earlier pass and formatting codes are parsed in a later pass. This means the formatting codes you see in the original text are not necessarily the same as the parser will see after templates have been expanded. Luckily this is as bad for human editors as it is for computers, so people tend to avoid writing templates that cause formatting codes to be parsed in a way that differs from what they would expect from reading the original wiki text before expanding templates. Parse Wiki Text assumes that templates never change the meaning of formatting codes around them.
63//!
64//! # Sandbox
65//!
66//! A sandbox ([Github](https://github.com/portstrom/parse_wiki_text_sandbox), [try online](https://portstrom.com/parse_wiki_text_sandbox/)) is available that allows interactively entering wiki text and inspecting the result of parsing it.
67//!
68//! # Comparison with Mediawiki Parser
69//!
70//! There is another crate called Mediawiki Parser ([crates.io](https://crates.io/crates/mediawiki_parser), [Github](https://github.com/vroland/mediawiki-parser)) that does basically the same thing, parsing wiki text to a tree of elements. That crate however doesn't take into account any of the astonishing amount of weirdness required to correctly parse wiki text. That crate admittedly only parses a subset of wiki text, with the intention to report errors for any text that is too weird to fit that subset, which is a good intention, but when examining it, that subset is quickly found to be too small to parse pages from actual wikis, and even worse, the error reporting is just an empty promise, and there's no indication when a text is incorrectly parsed.
71//!
72//! That crate could possibly be improved to always report errors when a text isn't in the supported subset, but pages found in real wikis very often don't conform to the small subset of wiki text that can be parsed without weirdness, so it still wouldn't be useful. Improving that crate to correctly parse a large enough subset of wiki text would be as much effort as starting over from scratch, which is why Parse Wiki Text was made without taking anything from Mediawiki Parser. Parse Wiki Text aims to correctly parse all wiki text, not just a subset, and report warnings when encountering weirdness that should be avoided.
73//!
74//! # Examples
75//!
76//! The default configuration is used for testing purposes only.
77//! For parsing a real wiki you need a site-specific configuration.
78//! Reuse the same configuration when parsing multiple pages for efficiency.
79//!
80//! ```
81//! use parse_wiki_text::{Configuration, Node};
82//! let wiki_text = concat!(
83//!     "==Our values==\n",
84//!     "*Correctness\n",
85//!     "*Speed\n",
86//!     "*Ergonomics"
87//! );
88//! let result = Configuration::default().parse(wiki_text);
89//! assert!(result.warnings.is_empty());
90//! # let mut found = false;
91//! for node in result.nodes {
92//!     if let Node::UnorderedList { items, .. } = node {
93//!         println!("Our values are:");
94//!         for item in items {
95//!             println!("- {}", item.nodes.iter().map(|node| match node {
96//!                 Node::Text { value, .. } => value,
97//!                 _ => ""
98//!             }).collect::<String>());
99//! #           found = true;
100//!         }
101//!     }
102//! }
103//! # assert!(found);
104//! ```
105
106#![forbid(unsafe_code)]
107#![warn(missing_docs)]
108
109mod bold_italic;
110mod case_folding_simple;
111mod character_entity;
112mod comment;
113mod configuration;
114mod default;
115mod external_link;
116mod heading;
117mod html_entities;
118mod line;
119mod link;
120mod list;
121mod magic_word;
122mod parse;
123mod positioned;
124mod redirect;
125mod state;
126mod table;
127mod tag;
128mod template;
129mod trie;
130mod warning;
131
132pub use configuration::ConfigurationSource;
133use configuration::Namespace;
134use state::{OpenNode, OpenNodeType, State};
135use std::{
136    borrow::Cow,
137    collections::{HashMap, HashSet},
138};
139use trie::Trie;
140pub use warning::{Warning, WarningMessage};
141
142/// Configuration for the parser.
143///
144/// A configuration to correctly parse a real wiki can be created with `Configuration::new`. A configuration for testing and quick and dirty prototyping can be created with `Default::default`.
145pub struct Configuration {
146    character_entities: Trie<char>,
147    link_trail_character_set: HashSet<char>,
148    magic_words: Trie<()>,
149    namespaces: Trie<Namespace>,
150    protocols: Trie<()>,
151    redirect_magic_words: Trie<()>,
152    tag_name_map: HashMap<String, TagClass>,
153}
154
155/// List item of a definition list.
156#[derive(Debug)]
157pub struct DefinitionListItem<'a> {
158    /// The byte position in the wiki text where the element ends.
159    pub end: usize,
160
161    /// The content of the element.
162    pub nodes: Vec<Node<'a>>,
163
164    /// The byte position in the wiki text where the element starts.
165    pub start: usize,
166
167    /// The type of list item.
168    pub type_: DefinitionListItemType,
169}
170
171/// Identifier for the type of a definition list item.
172#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
173pub enum DefinitionListItemType {
174    /// Parsed from the code `:`.
175    Details,
176
177    /// Parsed from the code `;`.
178    Term,
179}
180
181/// List item of an ordered list or unordered list.
182#[derive(Debug)]
183pub struct ListItem<'a> {
184    /// The byte position in the wiki text where the element ends.
185    pub end: usize,
186
187    /// The content of the element.
188    pub nodes: Vec<Node<'a>>,
189
190    /// The byte position in the wiki text where the element starts.
191    pub start: usize,
192}
193
194/// Parsed node.
195#[derive(Debug)]
196pub enum Node<'a> {
197    /// Toggle bold text. Parsed from the code `'''`.
198    Bold {
199        /// The byte position in the wiki text where the element ends.
200        end: usize,
201
202        /// The byte position in the wiki text where the element starts.
203        start: usize,
204    },
205
206    /// Toggle bold and italic text. Parsed from the code `'''''`.
207    BoldItalic {
208        /// The byte position in the wiki text where the element ends.
209        end: usize,
210
211        /// The byte position in the wiki text where the element starts.
212        start: usize,
213    },
214
215    /// Category. Parsed from code starting with `[[`, a category namespace and `:`.
216    Category {
217        /// The byte position in the wiki text where the element ends.
218        end: usize,
219
220        /// Additional information for sorting entries on the category page, if any.
221        ordinal: Vec<Node<'a>>,
222
223        /// The byte position in the wiki text where the element starts.
224        start: usize,
225
226        /// The category referred to.
227        target: &'a str,
228    },
229
230    /// Character entity. Parsed from code starting with `&` and ending with `;`.
231    CharacterEntity {
232        /// The character represented.
233        character: char,
234
235        /// The byte position in the wiki text where the element ends.
236        end: usize,
237
238        /// The byte position in the wiki text where the element starts.
239        start: usize,
240    },
241
242    /// Comment. Parsed from code starting with `<!--`.
243    Comment {
244        /// The byte position in the wiki text where the element ends.
245        end: usize,
246
247        /// The byte position in the wiki text where the element starts.
248        start: usize,
249    },
250
251    /// Definition list. Parsed from code starting with `:` or `;`.
252    DefinitionList {
253        /// The byte position in the wiki text where the element ends.
254        end: usize,
255
256        /// The list items of the list.
257        items: Vec<DefinitionListItem<'a>>,
258
259        /// The byte position in the wiki text where the element starts.
260        start: usize,
261    },
262
263    /// End tag. Parsed from code starting with `</` and a valid tag name.
264    EndTag {
265        /// The byte position in the wiki text where the element ends.
266        end: usize,
267
268        /// The tag name.
269        name: Cow<'a, str>,
270
271        /// The byte position in the wiki text where the element starts.
272        start: usize,
273    },
274
275    /// External link. Parsed from code starting with `[` and a valid protocol.
276    ExternalLink {
277        /// The byte position in the wiki text where the element ends.
278        end: usize,
279
280        /// The content of the element.
281        nodes: Vec<Node<'a>>,
282
283        /// The byte position in the wiki text where the element starts.
284        start: usize,
285    },
286
287    /// Heading. Parsed from code starting with `=` and ending with `=`.
288    Heading {
289        /// The byte position in the wiki text where the element ends.
290        end: usize,
291
292        /// The level of the heading from 1 to 6.
293        level: u8,
294
295        /// The content of the element.
296        nodes: Vec<Node<'a>>,
297
298        /// The byte position in the wiki text where the element starts.
299        start: usize,
300    },
301
302    /// Horizontal divider. Parsed from code starting with `----`.
303    HorizontalDivider {
304        /// The byte position in the wiki text where the element ends.
305        end: usize,
306
307        /// The byte position in the wiki text where the element starts.
308        start: usize,
309    },
310
311    /// Image. Parsed from code starting with `[[`, a file namespace and `:`.
312    Image {
313        /// The byte position in the wiki text where the element ends.
314        end: usize,
315
316        /// The byte position in the wiki text where the element starts.
317        start: usize,
318
319        /// The file name of the image.
320        target: &'a str,
321
322        /// Additional information for the image.
323        text: Vec<Node<'a>>,
324    },
325
326    /// Toggle italic text. Parsed from the code `''`.
327    Italic {
328        /// The byte position in the wiki text where the element ends.
329        end: usize,
330
331        /// The byte position in the wiki text where the element starts.
332        start: usize,
333    },
334
335    /// Link. Parsed from code starting with `[[` and ending with `]]`.
336    Link {
337        /// The byte position in the wiki text where the element ends.
338        end: usize,
339
340        /// The byte position in the wiki text where the element starts.
341        start: usize,
342
343        /// The target of the link.
344        target: &'a str,
345
346        /// The text to display for the link.
347        text: Vec<Node<'a>>,
348    },
349
350    /// Magic word. Parsed from the code `__`, a valid magic word and `__`.
351    MagicWord {
352        /// The byte position in the wiki text where the element ends.
353        end: usize,
354
355        /// The byte position in the wiki text where the element starts.
356        start: usize,
357    },
358
359    /// Ordered list. Parsed from code starting with `#`.
360    OrderedList {
361        /// The byte position in the wiki text where the element ends.
362        end: usize,
363
364        /// The list items of the list.
365        items: Vec<ListItem<'a>>,
366
367        /// The byte position in the wiki text where the element starts.
368        start: usize,
369    },
370
371    /// Paragraph break. Parsed from an empty line between elements that can appear within a paragraph.
372    ParagraphBreak {
373        /// The byte position in the wiki text where the element ends.
374        end: usize,
375
376        /// The byte position in the wiki text where the element starts.
377        start: usize,
378    },
379
380    /// Parameter. Parsed from code starting with `{{{` and ending with `}}}`.
381    Parameter {
382        /// The default value of the parameter.
383        default: Option<Vec<Node<'a>>>,
384
385        /// The byte position in the wiki text where the element ends.
386        end: usize,
387
388        /// The name of the parameter.
389        name: Vec<Node<'a>>,
390
391        /// The byte position in the wiki text where the element starts.
392        start: usize,
393    },
394
395    /// Block of preformatted text. Parsed from code starting with a space at the beginning of a line.
396    Preformatted {
397        /// The byte position in the wiki text where the element ends.
398        end: usize,
399
400        /// The content of the element.
401        nodes: Vec<Node<'a>>,
402
403        /// The byte position in the wiki text where the element starts.
404        start: usize,
405    },
406
407    /// Redirect. Parsed at the start of the wiki text from code starting with `#` followed by a redirect magic word.
408    Redirect {
409        /// The byte position in the wiki text where the element ends.
410        end: usize,
411
412        /// The target of the redirect.
413        target: &'a str,
414
415        /// The byte position in the wiki text where the element starts.
416        start: usize,
417    },
418
419    /// Start tag. Parsed from code starting with `<` and a valid tag name.
420    StartTag {
421        /// The byte position in the wiki text where the element ends.
422        end: usize,
423
424        /// The tag name.
425        name: Cow<'a, str>,
426
427        /// The byte position in the wiki text where the element starts.
428        start: usize,
429    },
430
431    /// Table. Parsed from code starting with `{|`.
432    Table {
433        /// The HTML attributes of the element.
434        attributes: Vec<Node<'a>>,
435
436        /// The captions of the table.
437        captions: Vec<TableCaption<'a>>,
438
439        /// The byte position in the wiki text where the element ends.
440        end: usize,
441
442        /// The rows of the table.
443        rows: Vec<TableRow<'a>>,
444
445        /// The byte position in the wiki text where the element starts.
446        start: usize,
447    },
448
449    /// Extension tag. Parsed from code starting with `<` and the tag name of a valid extension tag.
450    Tag {
451        /// The byte position in the wiki text where the element ends.
452        end: usize,
453
454        /// The tag name.
455        name: Cow<'a, str>,
456
457        /// The content of the tag, between the start tag and the end tag, if any.
458        nodes: Vec<Node<'a>>,
459
460        /// The byte position in the wiki text where the element starts.
461        start: usize,
462    },
463
464    /// Template. Parsed from code starting with `{{` and ending with `}}`.
465    Template {
466        /// The byte position in the wiki text where the element ends.
467        end: usize,
468
469        /// The name of the template.
470        name: Vec<Node<'a>>,
471
472        /// The parameters of the template.
473        parameters: Vec<Parameter<'a>>,
474
475        /// The byte position in the wiki text where the element starts.
476        start: usize,
477    },
478
479    /// Plain text.
480    Text {
481        /// The byte position in the wiki text where the element ends.
482        end: usize,
483
484        /// The byte position in the wiki text where the element starts.
485        start: usize,
486
487        /// The text.
488        value: &'a str,
489    },
490
491    /// Unordered list. Parsed from code starting with `*`.
492    UnorderedList {
493        /// The byte position in the wiki text where the element ends.
494        end: usize,
495
496        /// The list items of the list.
497        items: Vec<ListItem<'a>>,
498
499        /// The byte position in the wiki text where the element starts.
500        start: usize,
501    },
502}
503
504/// Output of parsing wiki text.
505#[derive(Debug)]
506pub struct Output<'a> {
507    /// The top level of parsed nodes.
508    pub nodes: Vec<Node<'a>>,
509
510    /// Warnings from the parser telling that something is not well-formed.
511    pub warnings: Vec<Warning>,
512}
513
514/// Template parameter.
515#[derive(Debug)]
516pub struct Parameter<'a> {
517    /// The byte position in the wiki text where the element ends.
518    pub end: usize,
519
520    /// The name of the parameter, if any.
521    pub name: Option<Vec<Node<'a>>>,
522
523    /// The byte position in the wiki text where the element starts.
524    pub start: usize,
525
526    /// The value of the parameter.
527    pub value: Vec<Node<'a>>,
528}
529
530/// Element that has a start position and end position.
531pub trait Positioned {
532    /// The byte position in the wiki text where the element ends.
533    fn end(&self) -> usize;
534
535    /// The byte position in the wiki text where the element starts.
536    fn start(&self) -> usize;
537}
538
539#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
540enum TagClass {
541    ExtensionTag,
542    Tag,
543}
544
545/// Table caption.
546#[derive(Debug)]
547pub struct TableCaption<'a> {
548    /// The HTML attributes of the element.
549    pub attributes: Option<Vec<Node<'a>>>,
550
551    /// The content of the element.
552    pub content: Vec<Node<'a>>,
553
554    /// The byte position in the wiki text where the element ends.
555    pub end: usize,
556
557    /// The byte position in the wiki text where the element starts.
558    pub start: usize,
559}
560
561/// Table cell.
562#[derive(Debug)]
563pub struct TableCell<'a> {
564    /// The HTML attributes of the element.
565    pub attributes: Option<Vec<Node<'a>>>,
566
567    /// The content of the element.
568    pub content: Vec<Node<'a>>,
569
570    /// The byte position in the wiki text where the element ends.
571    pub end: usize,
572
573    /// The byte position in the wiki text where the element starts.
574    pub start: usize,
575
576    /// The type of cell.
577    pub type_: TableCellType,
578}
579
580/// Type of table cell.
581#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
582pub enum TableCellType {
583    /// Heading cell.
584    Heading,
585
586    /// Ordinary cell.
587    Ordinary,
588}
589
590/// Table row.
591#[derive(Debug)]
592pub struct TableRow<'a> {
593    /// The HTML attributes of the element.
594    pub attributes: Vec<Node<'a>>,
595
596    /// The cells in the row.
597    pub cells: Vec<TableCell<'a>>,
598
599    /// The byte position in the wiki text where the element ends.
600    pub end: usize,
601
602    /// The byte position in the wiki text where the element starts.
603    pub start: usize,
604}