parse_wiki_text_2/lib.rs
1// Copyright 2019 Fredrik Portström <https://portstrom.com>
2// This is free software distributed under the terms specified in
3// the file LICENSE at the top-level directory of this distribution.
4
5//! Parse wiki text from Mediawiki into a tree of elements.
6//!
7//! # Introduction
8//!
9//! Wiki text is a format that follows the PHP maxim “Make everything as inconsistent and confusing as possible”. There are hundreds of millions of interesting documents written in this format, distributed under free licenses on sites that use the Mediawiki software, mainly Wikipedia and Wiktionary. Being able to parse wiki text and process these documents would allow access to a significant part of the world's knowledge.
10//!
11//! The Mediawiki software itself transforms a wiki text document into an HTML document in an outdated format to be displayed in a browser for a human reader. It does so through a [step by step procedure](https://www.mediawiki.org/wiki/Manual:Parser.php) of string substitutions, with some of the steps depending on the result of previous steps. [The main file for this procedure](https://doc.wikimedia.org/mediawiki-core/master/php/Parser_8php_source.html) has 6200 lines of code and the [second biggest file](https://doc.wikimedia.org/mediawiki-core/master/php/Preprocessor__DOM_8php_source.html) has 2000, and then there is a [1400 line file](https://doc.wikimedia.org/mediawiki-core/master/php/ParserOptions_8php_source.html) just to take options for the parser.
12//!
13//! What would be more interesting is to parse the wiki text document into a structure that can be used by a computer program to reason about the facts in the document and present them in different ways, making them available for a great variety of applications.
14//!
15//! Some people have tried to parse wiki text using regular expressions. This is incredibly naive and fails as soon as the wiki text is non-trivial. The capabilities of regular expressions don't come anywhere close to the complexity of the weirdness required to correctly parse wiki text. One project did a brave attempt to use a parser generator to parse wiki text. Wiki text was however never designed for formal parsers, so even parser generators are of no help in correctly parsing wiki text.
16//!
17//! Wiki text has a long history of poorly designed additions carelessly piled on top of each other. The syntax of wiki text is different in each wiki depending on its configuration. You can't even know what's a start tag until you see the corresponding end tag, and you can't know where the end tag is unless you parse the entire hierarchy of nested tags between the start tag and the end tag. In short: If you think you understand wiki text, you don't understand wiki text.
18//!
19//! Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions.
20//!
21//! # Design goals
22//!
23//! ## Correctness
24//!
25//! Parse Wiki Text is designed to parse wiki text exactly as parsed by Mediawiki. Even when there is obviously a bug in Mediawiki, Parse Wiki Text replicates that exact bug. If there is something Parse Wiki Text doesn't parse exactly the same as Mediawiki, please report it as an issue.
26//!
27//! ## Speed
28//!
29//! Parse Wiki Text is designed to parse a page in as little time as possible. It parses tens of thousands of pages per second on each processor core and can quickly parse an entire wiki with millions of pages. If there is anything that can be changed to make Parse Wiki Text faster, please report it as an issue.
30//!
31//! ## Safety
32//!
33//! Parse Wiki Text is designed to work with untrusted inputs. If any input doesn't parse safely with reasonable resources, please report it as an issue. No unsafe code is used.
34//!
35//! ## Platform support
36//!
37//! Parse Wiki Text is designed to run in a wide variety of environments, such as:
38//!
39//! - servers running machine code
40//! - browsers running Web Assembly
41//! - embedded in other programming languages
42//!
43//! Parse Wiki Text can be deployed anywhere with no dependencies.
44//!
45//! # Caution
46//!
47//! Wiki text is a legacy format used by legacy software. Parse Wiki Text is intended only to recover information that has been written for wikis running legacy software, replicating the exact bugs found in the legacy software. Please don't use wiki text as a format for new applications. Wiki text is a horrible format with an astonishing amount of inconsistencies, bad design choices and bugs. For new applications, please use a format that is designed to be easy to process, such as JSON or even better [CBOR](http://cbor.io). See [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) for an example of a wiki that uses JSON as its format and provides a rich interface for editing data instead of letting people write code. If you need to take information written in wiki text and reuse it in a new application, you can use Parse Wiki Text to convert it to an intermediate format that you can further process into a modern format.
48//!
49//! # Site configuration
50//!
51//! Wiki text has plenty of features that are parsed in a way that depends on the configuration of the wiki. This means the configuration must be known before parsing.
52//!
53//! - External links are parsed only when the scheme of the URI of the link is in the configured list of valid protocols. When the scheme is not valid, the link is parsed as plain text.
54//! - Categories and images superficially look they same way as links, but are parsed differently. These can only be distinguished by knowing the namespace aliases from the configuration of the wiki.
55//! - Text matching the configured set of magic words is parsed as magic words.
56//! - Extension tags have the same syntax as HTML tags, but are parsed differently. The configuration tells which tag names are to be treated as extension tags.
57//!
58//! The configuration can be seen by making a request to the [site info](https://www.mediawiki.org/wiki/API:Siteinfo) resource on the wiki. The utility [Fetch site configuration](https://github.com/portstrom/fetch_site_configuration) fetches the parts of the configuration needed for parsing pages in the wiki, and outputs Rust code for instantiating a parser with that configuration. Parse Wiki Text contains a default configuration that can be used for testing.
59//!
60//! # Limitations
61//!
62//! Wiki text was never designed to be possible to parse into a structured format. It's designed to be parsed in multiple passes, where each pass depends on the output on the previous pass. Most importantly, templates are expanded in an earlier pass and formatting codes are parsed in a later pass. This means the formatting codes you see in the original text are not necessarily the same as the parser will see after templates have been expanded. Luckily this is as bad for human editors as it is for computers, so people tend to avoid writing templates that cause formatting codes to be parsed in a way that differs from what they would expect from reading the original wiki text before expanding templates. Parse Wiki Text assumes that templates never change the meaning of formatting codes around them.
63//!
64//! # Sandbox
65//!
66//! A sandbox ([Github](https://github.com/portstrom/parse_wiki_text_sandbox), [try online](https://portstrom.com/parse_wiki_text_sandbox/)) is available that allows interactively entering wiki text and inspecting the result of parsing it.
67//!
68//! # Comparison with Mediawiki Parser
69//!
70//! There is another crate called Mediawiki Parser ([crates.io](https://crates.io/crates/mediawiki_parser), [Github](https://github.com/vroland/mediawiki-parser)) that does basically the same thing, parsing wiki text to a tree of elements. That crate however doesn't take into account any of the astonishing amount of weirdness required to correctly parse wiki text. That crate admittedly only parses a subset of wiki text, with the intention to report errors for any text that is too weird to fit that subset, which is a good intention, but when examining it, that subset is quickly found to be too small to parse pages from actual wikis, and even worse, the error reporting is just an empty promise, and there's no indication when a text is incorrectly parsed.
71//!
72//! That crate could possibly be improved to always report errors when a text isn't in the supported subset, but pages found in real wikis very often don't conform to the small subset of wiki text that can be parsed without weirdness, so it still wouldn't be useful. Improving that crate to correctly parse a large enough subset of wiki text would be as much effort as starting over from scratch, which is why Parse Wiki Text was made without taking anything from Mediawiki Parser. Parse Wiki Text aims to correctly parse all wiki text, not just a subset, and report warnings when encountering weirdness that should be avoided.
73//!
74//! # Examples
75//!
76//! The default configuration is used for testing purposes only.
77//! For parsing a real wiki you need a site-specific configuration.
78//! Reuse the same configuration when parsing multiple pages for efficiency.
79//!
80//! ```
81//! use parse_wiki_text_2::{Configuration, Node};
82//! let wiki_text = "\
83//! ==Our values==\n\
84//! *Correctness\n\
85//! *Speed\n\
86//! *Ergonomics\
87//! ";
88//! let result = Configuration::default().parse(wiki_text).expect("parsing timed out");
89//! assert!(result.warnings.is_empty());
90//! # let mut found = false;
91//! for node in result.nodes {
92//! if let Node::UnorderedList { items, .. } = node {
93//! println!("Our values are:");
94//! for item in items {
95//! let text = item.nodes.iter().map(|node| match node {
96//! Node::Text { value, .. } => value,
97//! _ => ""
98//! }).collect::<String>();
99//! println!("- {text}");
100//! # found = true;
101//! }
102//! }
103//! }
104//! # assert!(found);
105//! ```
106
107#![forbid(unsafe_code)]
108#![warn(missing_docs)]
109
110mod bold_italic;
111mod case_folding_simple;
112mod character_entity;
113mod comment;
114mod configuration;
115mod default;
116mod external_link;
117mod heading;
118mod html_entities;
119mod line;
120mod link;
121mod list;
122mod magic_word;
123mod parse;
124mod positioned;
125mod redirect;
126mod state;
127mod table;
128mod tag;
129mod template;
130mod trie;
131mod warning;
132
133pub use configuration::ConfigurationSource;
134use configuration::Namespace;
135pub use parse::ParseError;
136use state::{OpenNode, OpenNodeType, State};
137use std::{
138 borrow::Cow,
139 collections::{HashMap, HashSet},
140};
141use trie::Trie;
142pub use warning::{Warning, WarningMessage};
143
144/// Configuration for the parser.
145///
146/// A configuration to correctly parse a real wiki can be created with `Configuration::new`. A configuration for testing and quick and dirty prototyping can be created with `Default::default`.
147pub struct Configuration {
148 character_entities: Trie<char>,
149 link_trail_character_set: HashSet<char>,
150 magic_words: Trie<()>,
151 namespaces: Trie<Namespace>,
152 protocols: Trie<()>,
153 redirect_magic_words: Trie<()>,
154 tag_name_map: HashMap<String, TagClass>,
155}
156
157/// List item of a definition list.
158#[derive(Debug)]
159pub struct DefinitionListItem<'a> {
160 /// The byte position in the wiki text where the element ends.
161 pub end: usize,
162
163 /// The content of the element.
164 pub nodes: Vec<Node<'a>>,
165
166 /// The byte position in the wiki text where the element starts.
167 pub start: usize,
168
169 /// The type of list item.
170 pub type_: DefinitionListItemType,
171}
172
173/// Identifier for the type of a definition list item.
174#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
175pub enum DefinitionListItemType {
176 /// Parsed from the code `:`.
177 Details,
178
179 /// Parsed from the code `;`.
180 Term,
181}
182
183/// List item of an ordered list or unordered list.
184#[derive(Debug)]
185pub struct ListItem<'a> {
186 /// The byte position in the wiki text where the element ends.
187 pub end: usize,
188
189 /// The content of the element.
190 pub nodes: Vec<Node<'a>>,
191
192 /// The byte position in the wiki text where the element starts.
193 pub start: usize,
194}
195
196/// Parsed node.
197#[derive(Debug)]
198pub enum Node<'a> {
199 /// Toggle bold text. Parsed from the code `'''`.
200 Bold {
201 /// The byte position in the wiki text where the element ends.
202 end: usize,
203
204 /// The byte position in the wiki text where the element starts.
205 start: usize,
206 },
207
208 /// Toggle bold and italic text. Parsed from the code `'''''`.
209 BoldItalic {
210 /// The byte position in the wiki text where the element ends.
211 end: usize,
212
213 /// The byte position in the wiki text where the element starts.
214 start: usize,
215 },
216
217 /// Category. Parsed from code starting with `[[`, a category namespace and `:`.
218 Category {
219 /// The byte position in the wiki text where the element ends.
220 end: usize,
221
222 /// Additional information for sorting entries on the category page, if any.
223 ordinal: Vec<Node<'a>>,
224
225 /// The byte position in the wiki text where the element starts.
226 start: usize,
227
228 /// The category referred to.
229 target: &'a str,
230 },
231
232 /// Character entity. Parsed from code starting with `&` and ending with `;`.
233 CharacterEntity {
234 /// The character represented.
235 character: char,
236
237 /// The byte position in the wiki text where the element ends.
238 end: usize,
239
240 /// The byte position in the wiki text where the element starts.
241 start: usize,
242 },
243
244 /// Comment. Parsed from code starting with `<!--`.
245 Comment {
246 /// The byte position in the wiki text where the element ends.
247 end: usize,
248
249 /// The byte position in the wiki text where the element starts.
250 start: usize,
251 },
252
253 /// Definition list. Parsed from code starting with `:` or `;`.
254 DefinitionList {
255 /// The byte position in the wiki text where the element ends.
256 end: usize,
257
258 /// The list items of the list.
259 items: Vec<DefinitionListItem<'a>>,
260
261 /// The byte position in the wiki text where the element starts.
262 start: usize,
263 },
264
265 /// End tag. Parsed from code starting with `</` and a valid tag name.
266 EndTag {
267 /// The byte position in the wiki text where the element ends.
268 end: usize,
269
270 /// The tag name.
271 name: Cow<'a, str>,
272
273 /// The byte position in the wiki text where the element starts.
274 start: usize,
275 },
276
277 /// External link. Parsed from code starting with `[` and a valid protocol.
278 ExternalLink {
279 /// The byte position in the wiki text where the element ends.
280 end: usize,
281
282 /// The content of the element.
283 nodes: Vec<Node<'a>>,
284
285 /// The byte position in the wiki text where the element starts.
286 start: usize,
287 },
288
289 /// Heading. Parsed from code starting with `=` and ending with `=`.
290 Heading {
291 /// The byte position in the wiki text where the element ends.
292 end: usize,
293
294 /// The level of the heading from 1 to 6.
295 level: u8,
296
297 /// The content of the element.
298 nodes: Vec<Node<'a>>,
299
300 /// The byte position in the wiki text where the element starts.
301 start: usize,
302 },
303
304 /// Horizontal divider. Parsed from code starting with `----`.
305 HorizontalDivider {
306 /// The byte position in the wiki text where the element ends.
307 end: usize,
308
309 /// The byte position in the wiki text where the element starts.
310 start: usize,
311 },
312
313 /// Image. Parsed from code starting with `[[`, a file namespace and `:`.
314 Image {
315 /// The byte position in the wiki text where the element ends.
316 end: usize,
317
318 /// The byte position in the wiki text where the element starts.
319 start: usize,
320
321 /// The file name of the image.
322 target: &'a str,
323
324 /// Additional information for the image.
325 text: Vec<Node<'a>>,
326 },
327
328 /// Toggle italic text. Parsed from the code `''`.
329 Italic {
330 /// The byte position in the wiki text where the element ends.
331 end: usize,
332
333 /// The byte position in the wiki text where the element starts.
334 start: usize,
335 },
336
337 /// Link. Parsed from code starting with `[[` and ending with `]]`.
338 Link {
339 /// The byte position in the wiki text where the element ends.
340 end: usize,
341
342 /// The byte position in the wiki text where the element starts.
343 start: usize,
344
345 /// The target of the link.
346 target: &'a str,
347
348 /// The text to display for the link.
349 text: Vec<Node<'a>>,
350 },
351
352 /// Magic word. Parsed from the code `__`, a valid magic word and `__`.
353 MagicWord {
354 /// The byte position in the wiki text where the element ends.
355 end: usize,
356
357 /// The byte position in the wiki text where the element starts.
358 start: usize,
359 },
360
361 /// Ordered list. Parsed from code starting with `#`.
362 OrderedList {
363 /// The byte position in the wiki text where the element ends.
364 end: usize,
365
366 /// The list items of the list.
367 items: Vec<ListItem<'a>>,
368
369 /// The byte position in the wiki text where the element starts.
370 start: usize,
371 },
372
373 /// Paragraph break. Parsed from an empty line between elements that can appear within a paragraph.
374 ParagraphBreak {
375 /// The byte position in the wiki text where the element ends.
376 end: usize,
377
378 /// The byte position in the wiki text where the element starts.
379 start: usize,
380 },
381
382 /// Parameter. Parsed from code starting with `{{{` and ending with `}}}`.
383 Parameter {
384 /// The default value of the parameter.
385 default: Option<Vec<Node<'a>>>,
386
387 /// The byte position in the wiki text where the element ends.
388 end: usize,
389
390 /// The name of the parameter.
391 name: Vec<Node<'a>>,
392
393 /// The byte position in the wiki text where the element starts.
394 start: usize,
395 },
396
397 /// Block of preformatted text. Parsed from code starting with a space at the beginning of a line.
398 Preformatted {
399 /// The byte position in the wiki text where the element ends.
400 end: usize,
401
402 /// The content of the element.
403 nodes: Vec<Node<'a>>,
404
405 /// The byte position in the wiki text where the element starts.
406 start: usize,
407 },
408
409 /// Redirect. Parsed at the start of the wiki text from code starting with `#` followed by a redirect magic word.
410 Redirect {
411 /// The byte position in the wiki text where the element ends.
412 end: usize,
413
414 /// The target of the redirect.
415 target: &'a str,
416
417 /// The byte position in the wiki text where the element starts.
418 start: usize,
419 },
420
421 /// Start tag. Parsed from code starting with `<` and a valid tag name.
422 StartTag {
423 /// The byte position in the wiki text where the element ends.
424 end: usize,
425
426 /// The tag name.
427 name: Cow<'a, str>,
428
429 /// The byte position in the wiki text where the element starts.
430 start: usize,
431 },
432
433 /// Table. Parsed from code starting with `{|`.
434 Table {
435 /// The HTML attributes of the element.
436 attributes: Vec<Node<'a>>,
437
438 /// The captions of the table.
439 captions: Vec<TableCaption<'a>>,
440
441 /// The byte position in the wiki text where the element ends.
442 end: usize,
443
444 /// The rows of the table.
445 rows: Vec<TableRow<'a>>,
446
447 /// The byte position in the wiki text where the element starts.
448 start: usize,
449 },
450
451 /// Extension tag. Parsed from code starting with `<` and the tag name of a valid extension tag.
452 Tag {
453 /// The byte position in the wiki text where the element ends.
454 end: usize,
455
456 /// The tag name.
457 name: Cow<'a, str>,
458
459 /// The content of the tag, between the start tag and the end tag, if any.
460 nodes: Vec<Node<'a>>,
461
462 /// The byte position in the wiki text where the element starts.
463 start: usize,
464 },
465
466 /// Template. Parsed from code starting with `{{` and ending with `}}`.
467 Template {
468 /// The byte position in the wiki text where the element ends.
469 end: usize,
470
471 /// The name of the template.
472 name: Vec<Node<'a>>,
473
474 /// The parameters of the template.
475 parameters: Vec<Parameter<'a>>,
476
477 /// The byte position in the wiki text where the element starts.
478 start: usize,
479 },
480
481 /// Plain text.
482 Text {
483 /// The byte position in the wiki text where the element ends.
484 end: usize,
485
486 /// The byte position in the wiki text where the element starts.
487 start: usize,
488
489 /// The text.
490 value: &'a str,
491 },
492
493 /// Unordered list. Parsed from code starting with `*`.
494 UnorderedList {
495 /// The byte position in the wiki text where the element ends.
496 end: usize,
497
498 /// The list items of the list.
499 items: Vec<ListItem<'a>>,
500
501 /// The byte position in the wiki text where the element starts.
502 start: usize,
503 },
504}
505
506/// Output of parsing wiki text.
507#[derive(Debug)]
508pub struct Output<'a> {
509 /// The top level of parsed nodes.
510 pub nodes: Vec<Node<'a>>,
511
512 /// Warnings from the parser telling that something is not well-formed.
513 pub warnings: Vec<Warning>,
514}
515
516/// Template parameter.
517#[derive(Debug)]
518pub struct Parameter<'a> {
519 /// The byte position in the wiki text where the element ends.
520 pub end: usize,
521
522 /// The name of the parameter, if any.
523 pub name: Option<Vec<Node<'a>>>,
524
525 /// The byte position in the wiki text where the element starts.
526 pub start: usize,
527
528 /// The value of the parameter.
529 pub value: Vec<Node<'a>>,
530}
531
532/// Element that has a start position and end position.
533pub trait Positioned {
534 /// The byte position in the wiki text where the element ends.
535 fn end(&self) -> usize;
536
537 /// The byte position in the wiki text where the element starts.
538 fn start(&self) -> usize;
539}
540
541#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
542enum TagClass {
543 ExtensionTag,
544 Tag,
545}
546
547/// Table caption.
548#[derive(Debug)]
549pub struct TableCaption<'a> {
550 /// The HTML attributes of the element.
551 pub attributes: Option<Vec<Node<'a>>>,
552
553 /// The content of the element.
554 pub content: Vec<Node<'a>>,
555
556 /// The byte position in the wiki text where the element ends.
557 pub end: usize,
558
559 /// The byte position in the wiki text where the element starts.
560 pub start: usize,
561}
562
563/// Table cell.
564#[derive(Debug)]
565pub struct TableCell<'a> {
566 /// The HTML attributes of the element.
567 pub attributes: Option<Vec<Node<'a>>>,
568
569 /// The content of the element.
570 pub content: Vec<Node<'a>>,
571
572 /// The byte position in the wiki text where the element ends.
573 pub end: usize,
574
575 /// The byte position in the wiki text where the element starts.
576 pub start: usize,
577
578 /// The type of cell.
579 pub type_: TableCellType,
580}
581
582/// Type of table cell.
583#[derive(Copy, Clone, Debug, Eq, Hash, PartialEq)]
584pub enum TableCellType {
585 /// Heading cell.
586 Heading,
587
588 /// Ordinary cell.
589 Ordinary,
590}
591
592/// Table row.
593#[derive(Debug)]
594pub struct TableRow<'a> {
595 /// The HTML attributes of the element.
596 pub attributes: Vec<Node<'a>>,
597
598 /// The cells in the row.
599 pub cells: Vec<TableCell<'a>>,
600
601 /// The byte position in the wiki text where the element ends.
602 pub end: usize,
603
604 /// The byte position in the wiki text where the element starts.
605 pub start: usize,
606}