parse_wiki_text_2/
lib.rs

1// Copyright 2019 Fredrik Portström <https://portstrom.com>
2// This is free software distributed under the terms specified in
3// the file LICENSE at the top-level directory of this distribution.
4
5//! Parse wiki text from Mediawiki into a tree of elements.
6//!
7//! # Introduction
8//!
9//! Wiki text is a format that follows the PHP maxim “Make everything as inconsistent and confusing as possible”. There are hundreds of millions of interesting documents written in this format, distributed under free licenses on sites that use the Mediawiki software, mainly Wikipedia and Wiktionary. Being able to parse wiki text and process these documents would allow access to a significant part of the world's knowledge.
10//!
11//! The Mediawiki software itself transforms a wiki text document into an HTML document in an outdated format to be displayed in a browser for a human reader. It does so through a [step by step procedure](https://www.mediawiki.org/wiki/Manual:Parser.php) of string substitutions, with some of the steps depending on the result of previous steps. [The main file for this procedure](https://doc.wikimedia.org/mediawiki-core/master/php/Parser_8php_source.html) has 6200 lines of code and the [second biggest file](https://doc.wikimedia.org/mediawiki-core/master/php/Preprocessor__DOM_8php_source.html) has 2000, and then there is a [1400 line file](https://doc.wikimedia.org/mediawiki-core/master/php/ParserOptions_8php_source.html) just to take options for the parser.
12//!
13//! What would be more interesting is to parse the wiki text document into a structure that can be used by a computer program to reason about the facts in the document and present them in different ways, making them available for a great variety of applications.
14//!
15//! Some people have tried to parse wiki text using regular expressions. This is incredibly naive and fails as soon as the wiki text is non-trivial. The capabilities of regular expressions don't come anywhere close to the complexity of the weirdness required to correctly parse wiki text. One project did a brave attempt to use a parser generator to parse wiki text. Wiki text was however never designed for formal parsers, so even parser generators are of no help in correctly parsing wiki text.
16//!
17//! Wiki text has a long history of poorly designed additions carelessly piled on top of each other. The syntax of wiki text is different in each wiki depending on its configuration. You can't even know what's a start tag until you see the corresponding end tag, and you can't know where the end tag is unless you parse the entire hierarchy of nested tags between the start tag and the end tag. In short: If you think you understand wiki text, you don't understand wiki text.
18//!
19//! Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions.
20//!
21//! # Design goals
22//!
23//! ## Correctness
24//!
25//! Parse Wiki Text is designed to parse wiki text exactly as parsed by Mediawiki. Even when there is obviously a bug in Mediawiki, Parse Wiki Text replicates that exact bug. If there is something Parse Wiki Text doesn't parse exactly the same as Mediawiki, please report it as an issue.
26//!
27//! ## Speed
28//!
29//! Parse Wiki Text is designed to parse a page in as little time as possible. It parses tens of thousands of pages per second on each processor core and can quickly parse an entire wiki with millions of pages. If there is anything that can be changed to make Parse Wiki Text faster, please report it as an issue.
30//!
31//! ## Safety
32//!
33//! Parse Wiki Text is designed to work with untrusted inputs. If any input doesn't parse safely with reasonable resources, please report it as an issue. No unsafe code is used.
34//!
35//! ## Platform support
36//!
37//! Parse Wiki Text is designed to run in a wide variety of environments, such as:
38//!
39//! - servers running machine code
40//! - browsers running Web Assembly
41//! - embedded in other programming languages
42//!
43//! Parse Wiki Text can be deployed anywhere with no dependencies.
44//!
45//! # Caution
46//!
47//! Wiki text is a legacy format used by legacy software. Parse Wiki Text is intended only to recover information that has been written for wikis running legacy software, replicating the exact bugs found in the legacy software. Please don't use wiki text as a format for new applications. Wiki text is a horrible format with an astonishing amount of inconsistencies, bad design choices and bugs. For new applications, please use a format that is designed to be easy to process, such as JSON or even better [CBOR](http://cbor.io). See [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) for an example of a wiki that uses JSON as its format and provides a rich interface for editing data instead of letting people write code. If you need to take information written in wiki text and reuse it in a new application, you can use Parse Wiki Text to convert it to an intermediate format that you can further process into a modern format.
48//!
49//! # Site configuration
50//!
51//! Wiki text has plenty of features that are parsed in a way that depends on the configuration of the wiki. This means the configuration must be known before parsing.
52//!
53//! - External links are parsed only when the scheme of the URI of the link is in the configured list of valid protocols. When the scheme is not valid, the link is parsed as plain text.
54//! - Categories and images superficially look they same way as links, but are parsed differently. These can only be distinguished by knowing the namespace aliases from the configuration of the wiki.
55//! - Text matching the configured set of magic words is parsed as magic words.
56//! - Extension tags have the same syntax as HTML tags, but are parsed differently. The configuration tells which tag names are to be treated as extension tags.
57//!
58//! The configuration can be seen by making a request to the [site info](https://www.mediawiki.org/wiki/API:Siteinfo) resource on the wiki. The utility [Fetch site configuration](https://github.com/portstrom/fetch_site_configuration) fetches the parts of the configuration needed for parsing pages in the wiki, and outputs Rust code for instantiating a parser with that configuration. Parse Wiki Text contains a default configuration that can be used for testing.
59//!
60//! # Limitations
61//!
62//! Wiki text was never designed to be possible to parse into a structured format. It's designed to be parsed in multiple passes, where each pass depends on the output on the previous pass. Most importantly, templates are expanded in an earlier pass and formatting codes are parsed in a later pass. This means the formatting codes you see in the original text are not necessarily the same as the parser will see after templates have been expanded. Luckily this is as bad for human editors as it is for computers, so people tend to avoid writing templates that cause formatting codes to be parsed in a way that differs from what they would expect from reading the original wiki text before expanding templates. Parse Wiki Text assumes that templates never change the meaning of formatting codes around them.
63//!
64//! # Sandbox
65//!
66//! A sandbox ([Github](https://github.com/portstrom/parse_wiki_text_sandbox), [try online](https://portstrom.com/parse_wiki_text_sandbox/)) is available that allows interactively entering wiki text and inspecting the result of parsing it.
67//!
68//! # Comparison with Mediawiki Parser
69//!
70//! There is another crate called Mediawiki Parser ([crates.io](https://crates.io/crates/mediawiki_parser), [Github](https://github.com/vroland/mediawiki-parser)) that does basically the same thing, parsing wiki text to a tree of elements. That crate however doesn't take into account any of the astonishing amount of weirdness required to correctly parse wiki text. That crate admittedly only parses a subset of wiki text, with the intention to report errors for any text that is too weird to fit that subset, which is a good intention, but when examining it, that subset is quickly found to be too small to parse pages from actual wikis, and even worse, the error reporting is just an empty promise, and there's no indication when a text is incorrectly parsed.
71//!
72//! That crate could possibly be improved to always report errors when a text isn't in the supported subset, but pages found in real wikis very often don't conform to the small subset of wiki text that can be parsed without weirdness, so it still wouldn't be useful. Improving that crate to correctly parse a large enough subset of wiki text would be as much effort as starting over from scratch, which is why Parse Wiki Text was made without taking anything from Mediawiki Parser. Parse Wiki Text aims to correctly parse all wiki text, not just a subset, and report warnings when encountering weirdness that should be avoided.
73//!
74//! # Examples
75//!
76//! The default configuration is used for testing purposes only.
77//! For parsing a real wiki you need a site-specific configuration.
78//! Reuse the same configuration when parsing multiple pages for efficiency.
79//!
80//! ```
81//! use parse_wiki_text_2::{Configuration, Node};
82//! let wiki_text = "\
83//!		==Our values==\n\
84//!		*Correctness\n\
85//!		*Speed\n\
86//!		*Ergonomics\
87//! ";
88//! let result = Configuration::default().parse(wiki_text).expect("parsing timed out");
89//! assert!(result.warnings.is_empty());
90//! # let mut found = false;
91//! for node in result.nodes {
92//!	if let Node::UnorderedList { items, .. } = node {
93//!		println!("Our values are:");
94//!		for item in items {
95//! 		let text = item.nodes.iter().map(|node| match node {
96//!				Node::Text { value, .. } => value,
97//!				_ => ""
98//!			}).collect::<String>();
99//!			println!("- {text}");
100//! # 		found = true;
101//!		}
102//!	 }
103//! }
104//! # assert!(found);
105//! ```
106
107#![forbid(unsafe_code)]
108#![warn(missing_docs)]
109
110mod bold_italic;
111mod case_folding_simple;
112mod character_entity;
113mod comment;
114mod configuration;
115mod default;
116mod external_link;
117mod heading;
118mod html_entities;
119mod line;
120mod link;
121mod list;
122mod magic_word;
123mod parse;
124mod positioned;
125mod redirect;
126mod state;
127mod table;
128mod tag;
129mod template;
130mod trie;
131mod warning;
132
133pub use configuration::ConfigurationSource;
134use configuration::Namespace;
135pub use parse::ParseError;
136use state::{OpenNode, OpenNodeType, State};
137use std::{
138	borrow::Cow,
139	collections::{HashMap, HashSet},
140};
141use trie::Trie;
142pub use warning::{Warning, WarningMessage};
143
144/// Configuration for the parser.
145///
146/// A configuration to correctly parse a real wiki can be created with `Configuration::new`. A configuration for testing and quick and dirty prototyping can be created with `Default::default`.
147pub struct Configuration {
148	character_entities: Trie<char>,
149	link_trail_character_set: HashSet<char>,
150	magic_words: Trie<()>,
151	namespaces: Trie<Namespace>,
152	protocols: Trie<()>,
153	redirect_magic_words: Trie<()>,
154	tag_name_map: HashMap<String, TagClass>,
155}
156
157/// List item of a definition list.
158#[derive(Debug)]
159pub struct DefinitionListItem<'a> {
160	/// The byte position in the wiki text where the element ends.
161	pub end: usize,
162
163	/// The content of the element.
164	pub nodes: Vec<Node<'a>>,
165
166	/// The byte position in the wiki text where the element starts.
167	pub start: usize,
168
169	/// The type of list item.
170	pub type_: DefinitionListItemType,
171}
172
173/// Identifier for the type of a definition list item.
174#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
175pub enum DefinitionListItemType {
176	/// Parsed from the code `:`.
177	Details,
178
179	/// Parsed from the code `;`.
180	Term,
181}
182
183/// List item of an ordered list or unordered list.
184#[derive(Debug)]
185pub struct ListItem<'a> {
186	/// The byte position in the wiki text where the element ends.
187	pub end: usize,
188
189	/// The content of the element.
190	pub nodes: Vec<Node<'a>>,
191
192	/// The byte position in the wiki text where the element starts.
193	pub start: usize,
194}
195
196/// Parsed node.
197#[derive(Debug)]
198pub enum Node<'a> {
199	/// Toggle bold text. Parsed from the code `'''`.
200	Bold {
201		/// The byte position in the wiki text where the element ends.
202		end: usize,
203
204		/// The byte position in the wiki text where the element starts.
205		start: usize,
206	},
207
208	/// Toggle bold and italic text. Parsed from the code `'''''`.
209	BoldItalic {
210		/// The byte position in the wiki text where the element ends.
211		end: usize,
212
213		/// The byte position in the wiki text where the element starts.
214		start: usize,
215	},
216
217	/// Category. Parsed from code starting with `[[`, a category namespace and `:`.
218	Category {
219		/// The byte position in the wiki text where the element ends.
220		end: usize,
221
222		/// Additional information for sorting entries on the category page, if any.
223		ordinal: Vec<Node<'a>>,
224
225		/// The byte position in the wiki text where the element starts.
226		start: usize,
227
228		/// The category referred to.
229		target: &'a str,
230	},
231
232	/// Character entity. Parsed from code starting with `&` and ending with `;`.
233	CharacterEntity {
234		/// The character represented.
235		character: char,
236
237		/// The byte position in the wiki text where the element ends.
238		end: usize,
239
240		/// The byte position in the wiki text where the element starts.
241		start: usize,
242	},
243
244	/// Comment. Parsed from code starting with `<!--`.
245	Comment {
246		/// The byte position in the wiki text where the element ends.
247		end: usize,
248
249		/// The byte position in the wiki text where the element starts.
250		start: usize,
251	},
252
253	/// Definition list. Parsed from code starting with `:` or `;`.
254	DefinitionList {
255		/// The byte position in the wiki text where the element ends.
256		end: usize,
257
258		/// The list items of the list.
259		items: Vec<DefinitionListItem<'a>>,
260
261		/// The byte position in the wiki text where the element starts.
262		start: usize,
263	},
264
265	/// End tag. Parsed from code starting with `</` and a valid tag name.
266	EndTag {
267		/// The byte position in the wiki text where the element ends.
268		end: usize,
269
270		/// The tag name.
271		name: Cow<'a, str>,
272
273		/// The byte position in the wiki text where the element starts.
274		start: usize,
275	},
276
277	/// External link. Parsed from code starting with `[` and a valid protocol.
278	ExternalLink {
279		/// The byte position in the wiki text where the element ends.
280		end: usize,
281
282		/// The content of the element.
283		nodes: Vec<Node<'a>>,
284
285		/// The byte position in the wiki text where the element starts.
286		start: usize,
287	},
288
289	/// Heading. Parsed from code starting with `=` and ending with `=`.
290	Heading {
291		/// The byte position in the wiki text where the element ends.
292		end: usize,
293
294		/// The level of the heading from 1 to 6.
295		level: u8,
296
297		/// The content of the element.
298		nodes: Vec<Node<'a>>,
299
300		/// The byte position in the wiki text where the element starts.
301		start: usize,
302	},
303
304	/// Horizontal divider. Parsed from code starting with `----`.
305	HorizontalDivider {
306		/// The byte position in the wiki text where the element ends.
307		end: usize,
308
309		/// The byte position in the wiki text where the element starts.
310		start: usize,
311	},
312
313	/// Image. Parsed from code starting with `[[`, a file namespace and `:`.
314	Image {
315		/// The byte position in the wiki text where the element ends.
316		end: usize,
317
318		/// The byte position in the wiki text where the element starts.
319		start: usize,
320
321		/// The file name of the image.
322		target: &'a str,
323
324		/// Additional information for the image.
325		text: Vec<Node<'a>>,
326	},
327
328	/// Toggle italic text. Parsed from the code `''`.
329	Italic {
330		/// The byte position in the wiki text where the element ends.
331		end: usize,
332
333		/// The byte position in the wiki text where the element starts.
334		start: usize,
335	},
336
337	/// Link. Parsed from code starting with `[[` and ending with `]]`.
338	Link {
339		/// The byte position in the wiki text where the element ends.
340		end: usize,
341
342		/// The byte position in the wiki text where the element starts.
343		start: usize,
344
345		/// The target of the link.
346		target: &'a str,
347
348		/// The text to display for the link.
349		text: Vec<Node<'a>>,
350	},
351
352	/// Magic word. Parsed from the code `__`, a valid magic word and `__`.
353	MagicWord {
354		/// The byte position in the wiki text where the element ends.
355		end: usize,
356
357		/// The byte position in the wiki text where the element starts.
358		start: usize,
359	},
360
361	/// Ordered list. Parsed from code starting with `#`.
362	OrderedList {
363		/// The byte position in the wiki text where the element ends.
364		end: usize,
365
366		/// The list items of the list.
367		items: Vec<ListItem<'a>>,
368
369		/// The byte position in the wiki text where the element starts.
370		start: usize,
371	},
372
373	/// Paragraph break. Parsed from an empty line between elements that can appear within a paragraph.
374	ParagraphBreak {
375		/// The byte position in the wiki text where the element ends.
376		end: usize,
377
378		/// The byte position in the wiki text where the element starts.
379		start: usize,
380	},
381
382	/// Parameter. Parsed from code starting with `{{{` and ending with `}}}`.
383	Parameter {
384		/// The default value of the parameter.
385		default: Option<Vec<Node<'a>>>,
386
387		/// The byte position in the wiki text where the element ends.
388		end: usize,
389
390		/// The name of the parameter.
391		name: Vec<Node<'a>>,
392
393		/// The byte position in the wiki text where the element starts.
394		start: usize,
395	},
396
397	/// Block of preformatted text. Parsed from code starting with a space at the beginning of a line.
398	Preformatted {
399		/// The byte position in the wiki text where the element ends.
400		end: usize,
401
402		/// The content of the element.
403		nodes: Vec<Node<'a>>,
404
405		/// The byte position in the wiki text where the element starts.
406		start: usize,
407	},
408
409	/// Redirect. Parsed at the start of the wiki text from code starting with `#` followed by a redirect magic word.
410	Redirect {
411		/// The byte position in the wiki text where the element ends.
412		end: usize,
413
414		/// The target of the redirect.
415		target: &'a str,
416
417		/// The byte position in the wiki text where the element starts.
418		start: usize,
419	},
420
421	/// Start tag. Parsed from code starting with `<` and a valid tag name.
422	StartTag {
423		/// The byte position in the wiki text where the element ends.
424		end: usize,
425
426		/// The tag name.
427		name: Cow<'a, str>,
428
429		/// The byte position in the wiki text where the element starts.
430		start: usize,
431	},
432
433	/// Table. Parsed from code starting with `{|`.
434	Table {
435		/// The HTML attributes of the element.
436		attributes: Vec<Node<'a>>,
437
438		/// The captions of the table.
439		captions: Vec<TableCaption<'a>>,
440
441		/// The byte position in the wiki text where the element ends.
442		end: usize,
443
444		/// The rows of the table.
445		rows: Vec<TableRow<'a>>,
446
447		/// The byte position in the wiki text where the element starts.
448		start: usize,
449	},
450
451	/// Extension tag. Parsed from code starting with `<` and the tag name of a valid extension tag.
452	Tag {
453		/// The byte position in the wiki text where the element ends.
454		end: usize,
455
456		/// The tag name.
457		name: Cow<'a, str>,
458
459		/// The content of the tag, between the start tag and the end tag, if any.
460		nodes: Vec<Node<'a>>,
461
462		/// The byte position in the wiki text where the element starts.
463		start: usize,
464	},
465
466	/// Template. Parsed from code starting with `{{` and ending with `}}`.
467	Template {
468		/// The byte position in the wiki text where the element ends.
469		end: usize,
470
471		/// The name of the template.
472		name: Vec<Node<'a>>,
473
474		/// The parameters of the template.
475		parameters: Vec<Parameter<'a>>,
476
477		/// The byte position in the wiki text where the element starts.
478		start: usize,
479	},
480
481	/// Plain text.
482	Text {
483		/// The byte position in the wiki text where the element ends.
484		end: usize,
485
486		/// The byte position in the wiki text where the element starts.
487		start: usize,
488
489		/// The text.
490		value: &'a str,
491	},
492
493	/// Unordered list. Parsed from code starting with `*`.
494	UnorderedList {
495		/// The byte position in the wiki text where the element ends.
496		end: usize,
497
498		/// The list items of the list.
499		items: Vec<ListItem<'a>>,
500
501		/// The byte position in the wiki text where the element starts.
502		start: usize,
503	},
504}
505
506/// Output of parsing wiki text.
507#[derive(Debug)]
508pub struct Output<'a> {
509	/// The top level of parsed nodes.
510	pub nodes: Vec<Node<'a>>,
511
512	/// Warnings from the parser telling that something is not well-formed.
513	pub warnings: Vec<Warning>,
514}
515
516/// Template parameter.
517#[derive(Debug)]
518pub struct Parameter<'a> {
519	/// The byte position in the wiki text where the element ends.
520	pub end: usize,
521
522	/// The name of the parameter, if any.
523	pub name: Option<Vec<Node<'a>>>,
524
525	/// The byte position in the wiki text where the element starts.
526	pub start: usize,
527
528	/// The value of the parameter.
529	pub value: Vec<Node<'a>>,
530}
531
532/// Element that has a start position and end position.
533pub trait Positioned {
534	/// The byte position in the wiki text where the element ends.
535	fn end(&self) -> usize;
536
537	/// The byte position in the wiki text where the element starts.
538	fn start(&self) -> usize;
539}
540
541#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
542enum TagClass {
543	ExtensionTag,
544	Tag,
545}
546
547/// Table caption.
548#[derive(Debug)]
549pub struct TableCaption<'a> {
550	/// The HTML attributes of the element.
551	pub attributes: Option<Vec<Node<'a>>>,
552
553	/// The content of the element.
554	pub content: Vec<Node<'a>>,
555
556	/// The byte position in the wiki text where the element ends.
557	pub end: usize,
558
559	/// The byte position in the wiki text where the element starts.
560	pub start: usize,
561}
562
563/// Table cell.
564#[derive(Debug)]
565pub struct TableCell<'a> {
566	/// The HTML attributes of the element.
567	pub attributes: Option<Vec<Node<'a>>>,
568
569	/// The content of the element.
570	pub content: Vec<Node<'a>>,
571
572	/// The byte position in the wiki text where the element ends.
573	pub end: usize,
574
575	/// The byte position in the wiki text where the element starts.
576	pub start: usize,
577
578	/// The type of cell.
579	pub type_: TableCellType,
580}
581
582/// Type of table cell.
583#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
584pub enum TableCellType {
585	/// Heading cell.
586	Heading,
587
588	/// Ordinary cell.
589	Ordinary,
590}
591
592/// Table row.
593#[derive(Debug)]
594pub struct TableRow<'a> {
595	/// The HTML attributes of the element.
596	pub attributes: Vec<Node<'a>>,
597
598	/// The cells in the row.
599	pub cells: Vec<TableCell<'a>>,
600
601	/// The byte position in the wiki text where the element ends.
602	pub end: usize,
603
604	/// The byte position in the wiki text where the element starts.
605	pub start: usize,
606}