parse_wiki_text/lib.rs
1// Copyright 2019 Fredrik Portström <https://portstrom.com>
2// This is free software distributed under the terms specified in
3// the file LICENSE at the top-level directory of this distribution.
4
5//! Parse wiki text from Mediawiki into a tree of elements.
6//!
7//! # Introduction
8//!
9//! Wiki text is a format that follows the PHP maxim “Make everything as inconsistent and confusing as possible”. There are hundreds of millions of interesting documents written in this format, distributed under free licenses on sites that use the Mediawiki software, mainly Wikipedia and Wiktionary. Being able to parse wiki text and process these documents would allow access to a significant part of the world's knowledge.
10//!
11//! The Mediawiki software itself transforms a wiki text document into an HTML document in an outdated format to be displayed in a browser for a human reader. It does so through a [step by step procedure](https://www.mediawiki.org/wiki/Manual:Parser.php) of string substitutions, with some of the steps depending on the result of previous steps. [The main file for this procedure](https://doc.wikimedia.org/mediawiki-core/master/php/Parser_8php_source.html) has 6200 lines of code and the [second biggest file](https://doc.wikimedia.org/mediawiki-core/master/php/Preprocessor__DOM_8php_source.html) has 2000, and then there is a [1400 line file](https://doc.wikimedia.org/mediawiki-core/master/php/ParserOptions_8php_source.html) just to take options for the parser.
12//!
13//! What would be more interesting is to parse the wiki text document into a structure that can be used by a computer program to reason about the facts in the document and present them in different ways, making them available for a great variety of applications.
14//!
15//! Some people have tried to parse wiki text using regular expressions. This is incredibly naive and fails as soon as the wiki text is non-trivial. The capabilities of regular expressions don't come anywhere close to the complexity of the weirdness required to correctly parse wiki text. One project did a brave attempt to use a parser generator to parse wiki text. Wiki text was however never designed for formal parsers, so even parser generators are of no help in correctly parsing wiki text.
16//!
17//! Wiki text has a long history of poorly designed additions carelessly piled on top of each other. The syntax of wiki text is different in each wiki depending on its configuration. You can't even know what's a start tag until you see the corresponding end tag, and you can't know where the end tag is unless you parse the entire hierarchy of nested tags between the start tag and the end tag. In short: If you think you understand wiki text, you don't understand wiki text.
18//!
19//! Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions.
20//!
21//! # Design goals
22//!
23//! ## Correctness
24//!
25//! Parse Wiki Text is designed to parse wiki text exactly as parsed by Mediawiki. Even when there is obviously a bug in Mediawiki, Parse Wiki Text replicates that exact bug. If there is something Parse Wiki Text doesn't parse exactly the same as Mediawiki, please report it as an issue.
26//!
27//! ## Speed
28//!
29//! Parse Wiki Text is designed to parse a page in as little time as possible. It parses tens of thousands of pages per second on each processor core and can quickly parse an entire wiki with millions of pages. If there is anything that can be changed to make Parse Wiki Text faster, please report it as an issue.
30//!
31//! ## Safety
32//!
33//! Parse Wiki Text is designed to work with untrusted inputs. If any input doesn't parse safely with reasonable resources, please report it as an issue. No unsafe code is used.
34//!
35//! ## Platform support
36//!
37//! Parse Wiki Text is designed to run in a wide variety of environments, such as:
38//!
39//! - servers running machine code
40//! - browsers running Web Assembly
41//! - embedded in other programming languages
42//!
43//! Parse Wiki Text can be deployed anywhere with no dependencies.
44//!
45//! # Caution
46//!
47//! Wiki text is a legacy format used by legacy software. Parse Wiki Text is intended only to recover information that has been written for wikis running legacy software, replicating the exact bugs found in the legacy software. Please don't use wiki text as a format for new applications. Wiki text is a horrible format with an astonishing amount of inconsistencies, bad design choices and bugs. For new applications, please use a format that is designed to be easy to process, such as JSON or even better [CBOR](http://cbor.io). See [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) for an example of a wiki that uses JSON as its format and provides a rich interface for editing data instead of letting people write code. If you need to take information written in wiki text and reuse it in a new application, you can use Parse Wiki Text to convert it to an intermediate format that you can further process into a modern format.
48//!
49//! # Site configuration
50//!
51//! Wiki text has plenty of features that are parsed in a way that depends on the configuration of the wiki. This means the configuration must be known before parsing.
52//!
53//! - External links are parsed only when the scheme of the URI of the link is in the configured list of valid protocols. When the scheme is not valid, the link is parsed as plain text.
54//! - Categories and images superficially look they same way as links, but are parsed differently. These can only be distinguished by knowing the namespace aliases from the configuration of the wiki.
55//! - Text matching the configured set of magic words is parsed as magic words.
56//! - Extension tags have the same syntax as HTML tags, but are parsed differently. The configuration tells which tag names are to be treated as extension tags.
57//!
58//! The configuration can be seen by making a request to the [site info](https://www.mediawiki.org/wiki/API:Siteinfo) resource on the wiki. The utility [Fetch site configuration](https://github.com/portstrom/fetch_site_configuration) fetches the parts of the configuration needed for parsing pages in the wiki, and outputs Rust code for instantiating a parser with that configuration. Parse Wiki Text contains a default configuration that can be used for testing.
59//!
60//! # Limitations
61//!
62//! Wiki text was never designed to be possible to parse into a structured format. It's designed to be parsed in multiple passes, where each pass depends on the output on the previous pass. Most importantly, templates are expanded in an earlier pass and formatting codes are parsed in a later pass. This means the formatting codes you see in the original text are not necessarily the same as the parser will see after templates have been expanded. Luckily this is as bad for human editors as it is for computers, so people tend to avoid writing templates that cause formatting codes to be parsed in a way that differs from what they would expect from reading the original wiki text before expanding templates. Parse Wiki Text assumes that templates never change the meaning of formatting codes around them.
63//!
64//! # Sandbox
65//!
66//! A sandbox ([Github](https://github.com/portstrom/parse_wiki_text_sandbox), [try online](https://portstrom.com/parse_wiki_text_sandbox/)) is available that allows interactively entering wiki text and inspecting the result of parsing it.
67//!
68//! # Comparison with Mediawiki Parser
69//!
70//! There is another crate called Mediawiki Parser ([crates.io](https://crates.io/crates/mediawiki_parser), [Github](https://github.com/vroland/mediawiki-parser)) that does basically the same thing, parsing wiki text to a tree of elements. That crate however doesn't take into account any of the astonishing amount of weirdness required to correctly parse wiki text. That crate admittedly only parses a subset of wiki text, with the intention to report errors for any text that is too weird to fit that subset, which is a good intention, but when examining it, that subset is quickly found to be too small to parse pages from actual wikis, and even worse, the error reporting is just an empty promise, and there's no indication when a text is incorrectly parsed.
71//!
72//! That crate could possibly be improved to always report errors when a text isn't in the supported subset, but pages found in real wikis very often don't conform to the small subset of wiki text that can be parsed without weirdness, so it still wouldn't be useful. Improving that crate to correctly parse a large enough subset of wiki text would be as much effort as starting over from scratch, which is why Parse Wiki Text was made without taking anything from Mediawiki Parser. Parse Wiki Text aims to correctly parse all wiki text, not just a subset, and report warnings when encountering weirdness that should be avoided.
73//!
74//! # Examples
75//!
76//! The default configuration is used for testing purposes only.
77//! For parsing a real wiki you need a site-specific configuration.
78//! Reuse the same configuration when parsing multiple pages for efficiency.
79//!
80//! ```
81//! use parse_wiki_text::{Configuration, Node};
82//! let wiki_text = concat!(
83//! "==Our values==\n",
84//! "*Correctness\n",
85//! "*Speed\n",
86//! "*Ergonomics"
87//! );
88//! let result = Configuration::default().parse(wiki_text);
89//! assert!(result.warnings.is_empty());
90//! # let mut found = false;
91//! for node in result.nodes {
92//! if let Node::UnorderedList { items, .. } = node {
93//! println!("Our values are:");
94//! for item in items {
95//! println!("- {}", item.nodes.iter().map(|node| match node {
96//! Node::Text { value, .. } => value,
97//! _ => ""
98//! }).collect::<String>());
99//! # found = true;
100//! }
101//! }
102//! }
103//! # assert!(found);
104//! ```
105
106#![forbid(unsafe_code)]
107#![warn(missing_docs)]
108
109mod bold_italic;
110mod case_folding_simple;
111mod character_entity;
112mod comment;
113mod configuration;
114mod default;
115mod external_link;
116mod heading;
117mod html_entities;
118mod line;
119mod link;
120mod list;
121mod magic_word;
122mod parse;
123mod positioned;
124mod redirect;
125mod state;
126mod table;
127mod tag;
128mod template;
129mod trie;
130mod warning;
131
132pub use configuration::ConfigurationSource;
133use configuration::Namespace;
134use state::{OpenNode, OpenNodeType, State};
135use std::{
136 borrow::Cow,
137 collections::{HashMap, HashSet},
138};
139use trie::Trie;
140pub use warning::{Warning, WarningMessage};
141
142/// Configuration for the parser.
143///
144/// A configuration to correctly parse a real wiki can be created with `Configuration::new`. A configuration for testing and quick and dirty prototyping can be created with `Default::default`.
145pub struct Configuration {
146 character_entities: Trie<char>,
147 link_trail_character_set: HashSet<char>,
148 magic_words: Trie<()>,
149 namespaces: Trie<Namespace>,
150 protocols: Trie<()>,
151 redirect_magic_words: Trie<()>,
152 tag_name_map: HashMap<String, TagClass>,
153}
154
155/// List item of a definition list.
156#[derive(Debug)]
157pub struct DefinitionListItem<'a> {
158 /// The byte position in the wiki text where the element ends.
159 pub end: usize,
160
161 /// The content of the element.
162 pub nodes: Vec<Node<'a>>,
163
164 /// The byte position in the wiki text where the element starts.
165 pub start: usize,
166
167 /// The type of list item.
168 pub type_: DefinitionListItemType,
169}
170
171/// Identifier for the type of a definition list item.
172#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
173pub enum DefinitionListItemType {
174 /// Parsed from the code `:`.
175 Details,
176
177 /// Parsed from the code `;`.
178 Term,
179}
180
181/// List item of an ordered list or unordered list.
182#[derive(Debug)]
183pub struct ListItem<'a> {
184 /// The byte position in the wiki text where the element ends.
185 pub end: usize,
186
187 /// The content of the element.
188 pub nodes: Vec<Node<'a>>,
189
190 /// The byte position in the wiki text where the element starts.
191 pub start: usize,
192}
193
194/// Parsed node.
195#[derive(Debug)]
196pub enum Node<'a> {
197 /// Toggle bold text. Parsed from the code `'''`.
198 Bold {
199 /// The byte position in the wiki text where the element ends.
200 end: usize,
201
202 /// The byte position in the wiki text where the element starts.
203 start: usize,
204 },
205
206 /// Toggle bold and italic text. Parsed from the code `'''''`.
207 BoldItalic {
208 /// The byte position in the wiki text where the element ends.
209 end: usize,
210
211 /// The byte position in the wiki text where the element starts.
212 start: usize,
213 },
214
215 /// Category. Parsed from code starting with `[[`, a category namespace and `:`.
216 Category {
217 /// The byte position in the wiki text where the element ends.
218 end: usize,
219
220 /// Additional information for sorting entries on the category page, if any.
221 ordinal: Vec<Node<'a>>,
222
223 /// The byte position in the wiki text where the element starts.
224 start: usize,
225
226 /// The category referred to.
227 target: &'a str,
228 },
229
230 /// Character entity. Parsed from code starting with `&` and ending with `;`.
231 CharacterEntity {
232 /// The character represented.
233 character: char,
234
235 /// The byte position in the wiki text where the element ends.
236 end: usize,
237
238 /// The byte position in the wiki text where the element starts.
239 start: usize,
240 },
241
242 /// Comment. Parsed from code starting with `<!--`.
243 Comment {
244 /// The byte position in the wiki text where the element ends.
245 end: usize,
246
247 /// The byte position in the wiki text where the element starts.
248 start: usize,
249 },
250
251 /// Definition list. Parsed from code starting with `:` or `;`.
252 DefinitionList {
253 /// The byte position in the wiki text where the element ends.
254 end: usize,
255
256 /// The list items of the list.
257 items: Vec<DefinitionListItem<'a>>,
258
259 /// The byte position in the wiki text where the element starts.
260 start: usize,
261 },
262
263 /// End tag. Parsed from code starting with `</` and a valid tag name.
264 EndTag {
265 /// The byte position in the wiki text where the element ends.
266 end: usize,
267
268 /// The tag name.
269 name: Cow<'a, str>,
270
271 /// The byte position in the wiki text where the element starts.
272 start: usize,
273 },
274
275 /// External link. Parsed from code starting with `[` and a valid protocol.
276 ExternalLink {
277 /// The byte position in the wiki text where the element ends.
278 end: usize,
279
280 /// The content of the element.
281 nodes: Vec<Node<'a>>,
282
283 /// The byte position in the wiki text where the element starts.
284 start: usize,
285 },
286
287 /// Heading. Parsed from code starting with `=` and ending with `=`.
288 Heading {
289 /// The byte position in the wiki text where the element ends.
290 end: usize,
291
292 /// The level of the heading from 1 to 6.
293 level: u8,
294
295 /// The content of the element.
296 nodes: Vec<Node<'a>>,
297
298 /// The byte position in the wiki text where the element starts.
299 start: usize,
300 },
301
302 /// Horizontal divider. Parsed from code starting with `----`.
303 HorizontalDivider {
304 /// The byte position in the wiki text where the element ends.
305 end: usize,
306
307 /// The byte position in the wiki text where the element starts.
308 start: usize,
309 },
310
311 /// Image. Parsed from code starting with `[[`, a file namespace and `:`.
312 Image {
313 /// The byte position in the wiki text where the element ends.
314 end: usize,
315
316 /// The byte position in the wiki text where the element starts.
317 start: usize,
318
319 /// The file name of the image.
320 target: &'a str,
321
322 /// Additional information for the image.
323 text: Vec<Node<'a>>,
324 },
325
326 /// Toggle italic text. Parsed from the code `''`.
327 Italic {
328 /// The byte position in the wiki text where the element ends.
329 end: usize,
330
331 /// The byte position in the wiki text where the element starts.
332 start: usize,
333 },
334
335 /// Link. Parsed from code starting with `[[` and ending with `]]`.
336 Link {
337 /// The byte position in the wiki text where the element ends.
338 end: usize,
339
340 /// The byte position in the wiki text where the element starts.
341 start: usize,
342
343 /// The target of the link.
344 target: &'a str,
345
346 /// The text to display for the link.
347 text: Vec<Node<'a>>,
348 },
349
350 /// Magic word. Parsed from the code `__`, a valid magic word and `__`.
351 MagicWord {
352 /// The byte position in the wiki text where the element ends.
353 end: usize,
354
355 /// The byte position in the wiki text where the element starts.
356 start: usize,
357 },
358
359 /// Ordered list. Parsed from code starting with `#`.
360 OrderedList {
361 /// The byte position in the wiki text where the element ends.
362 end: usize,
363
364 /// The list items of the list.
365 items: Vec<ListItem<'a>>,
366
367 /// The byte position in the wiki text where the element starts.
368 start: usize,
369 },
370
371 /// Paragraph break. Parsed from an empty line between elements that can appear within a paragraph.
372 ParagraphBreak {
373 /// The byte position in the wiki text where the element ends.
374 end: usize,
375
376 /// The byte position in the wiki text where the element starts.
377 start: usize,
378 },
379
380 /// Parameter. Parsed from code starting with `{{{` and ending with `}}}`.
381 Parameter {
382 /// The default value of the parameter.
383 default: Option<Vec<Node<'a>>>,
384
385 /// The byte position in the wiki text where the element ends.
386 end: usize,
387
388 /// The name of the parameter.
389 name: Vec<Node<'a>>,
390
391 /// The byte position in the wiki text where the element starts.
392 start: usize,
393 },
394
395 /// Block of preformatted text. Parsed from code starting with a space at the beginning of a line.
396 Preformatted {
397 /// The byte position in the wiki text where the element ends.
398 end: usize,
399
400 /// The content of the element.
401 nodes: Vec<Node<'a>>,
402
403 /// The byte position in the wiki text where the element starts.
404 start: usize,
405 },
406
407 /// Redirect. Parsed at the start of the wiki text from code starting with `#` followed by a redirect magic word.
408 Redirect {
409 /// The byte position in the wiki text where the element ends.
410 end: usize,
411
412 /// The target of the redirect.
413 target: &'a str,
414
415 /// The byte position in the wiki text where the element starts.
416 start: usize,
417 },
418
419 /// Start tag. Parsed from code starting with `<` and a valid tag name.
420 StartTag {
421 /// The byte position in the wiki text where the element ends.
422 end: usize,
423
424 /// The tag name.
425 name: Cow<'a, str>,
426
427 /// The byte position in the wiki text where the element starts.
428 start: usize,
429 },
430
431 /// Table. Parsed from code starting with `{|`.
432 Table {
433 /// The HTML attributes of the element.
434 attributes: Vec<Node<'a>>,
435
436 /// The captions of the table.
437 captions: Vec<TableCaption<'a>>,
438
439 /// The byte position in the wiki text where the element ends.
440 end: usize,
441
442 /// The rows of the table.
443 rows: Vec<TableRow<'a>>,
444
445 /// The byte position in the wiki text where the element starts.
446 start: usize,
447 },
448
449 /// Extension tag. Parsed from code starting with `<` and the tag name of a valid extension tag.
450 Tag {
451 /// The byte position in the wiki text where the element ends.
452 end: usize,
453
454 /// The tag name.
455 name: Cow<'a, str>,
456
457 /// The content of the tag, between the start tag and the end tag, if any.
458 nodes: Vec<Node<'a>>,
459
460 /// The byte position in the wiki text where the element starts.
461 start: usize,
462 },
463
464 /// Template. Parsed from code starting with `{{` and ending with `}}`.
465 Template {
466 /// The byte position in the wiki text where the element ends.
467 end: usize,
468
469 /// The name of the template.
470 name: Vec<Node<'a>>,
471
472 /// The parameters of the template.
473 parameters: Vec<Parameter<'a>>,
474
475 /// The byte position in the wiki text where the element starts.
476 start: usize,
477 },
478
479 /// Plain text.
480 Text {
481 /// The byte position in the wiki text where the element ends.
482 end: usize,
483
484 /// The byte position in the wiki text where the element starts.
485 start: usize,
486
487 /// The text.
488 value: &'a str,
489 },
490
491 /// Unordered list. Parsed from code starting with `*`.
492 UnorderedList {
493 /// The byte position in the wiki text where the element ends.
494 end: usize,
495
496 /// The list items of the list.
497 items: Vec<ListItem<'a>>,
498
499 /// The byte position in the wiki text where the element starts.
500 start: usize,
501 },
502}
503
504/// Output of parsing wiki text.
505#[derive(Debug)]
506pub struct Output<'a> {
507 /// The top level of parsed nodes.
508 pub nodes: Vec<Node<'a>>,
509
510 /// Warnings from the parser telling that something is not well-formed.
511 pub warnings: Vec<Warning>,
512}
513
514/// Template parameter.
515#[derive(Debug)]
516pub struct Parameter<'a> {
517 /// The byte position in the wiki text where the element ends.
518 pub end: usize,
519
520 /// The name of the parameter, if any.
521 pub name: Option<Vec<Node<'a>>>,
522
523 /// The byte position in the wiki text where the element starts.
524 pub start: usize,
525
526 /// The value of the parameter.
527 pub value: Vec<Node<'a>>,
528}
529
530/// Element that has a start position and end position.
531pub trait Positioned {
532 /// The byte position in the wiki text where the element ends.
533 fn end(&self) -> usize;
534
535 /// The byte position in the wiki text where the element starts.
536 fn start(&self) -> usize;
537}
538
539#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
540enum TagClass {
541 ExtensionTag,
542 Tag,
543}
544
545/// Table caption.
546#[derive(Debug)]
547pub struct TableCaption<'a> {
548 /// The HTML attributes of the element.
549 pub attributes: Option<Vec<Node<'a>>>,
550
551 /// The content of the element.
552 pub content: Vec<Node<'a>>,
553
554 /// The byte position in the wiki text where the element ends.
555 pub end: usize,
556
557 /// The byte position in the wiki text where the element starts.
558 pub start: usize,
559}
560
561/// Table cell.
562#[derive(Debug)]
563pub struct TableCell<'a> {
564 /// The HTML attributes of the element.
565 pub attributes: Option<Vec<Node<'a>>>,
566
567 /// The content of the element.
568 pub content: Vec<Node<'a>>,
569
570 /// The byte position in the wiki text where the element ends.
571 pub end: usize,
572
573 /// The byte position in the wiki text where the element starts.
574 pub start: usize,
575
576 /// The type of cell.
577 pub type_: TableCellType,
578}
579
580/// Type of table cell.
581#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
582pub enum TableCellType {
583 /// Heading cell.
584 Heading,
585
586 /// Ordinary cell.
587 Ordinary,
588}
589
590/// Table row.
591#[derive(Debug)]
592pub struct TableRow<'a> {
593 /// The HTML attributes of the element.
594 pub attributes: Vec<Node<'a>>,
595
596 /// The cells in the row.
597 pub cells: Vec<TableCell<'a>>,
598
599 /// The byte position in the wiki text where the element ends.
600 pub end: usize,
601
602 /// The byte position in the wiki text where the element starts.
603 pub start: usize,
604}