pest/lib.rs
1// pest. The Elegant Parser
2// Copyright (c) 2018 Dragoș Tiselice
3//
4// Licensed under the Apache License, Version 2.0
5// <LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0> or the MIT
6// license <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
7// option. All files in the project carrying such notice may not be copied,
8// modified, or distributed except according to those terms.
9#![no_std]
10#![doc(
11 html_logo_url = "https://raw.githubusercontent.com/pest-parser/pest/master/pest-logo.svg",
12 html_favicon_url = "https://raw.githubusercontent.com/pest-parser/pest/master/pest-logo.svg"
13)]
14#![warn(missing_docs, rust_2018_idioms, unused_qualifications)]
15#![allow(clippy::doc_overindented_list_items)]
16//! # pest. The Elegant Parser
17//!
18//! pest is a general purpose parser written in Rust with a focus on accessibility, correctness,
19//! and performance. It uses parsing expression grammars (or [PEG]) as input, which are similar in
20//! spirit to regular expressions, but which offer the enhanced expressivity needed to parse
21//! complex languages.
22//!
23//! [PEG]: https://en.wikipedia.org/wiki/Parsing_expression_grammar
24//!
25//! ## Getting started
26//!
27//! The recommended way to start parsing with pest is to read the official [book].
28//!
29//! Other helpful resources:
30//!
31//! * API reference on [docs.rs]
32//! * play with grammars and share them on our [fiddle]
33//! * find previous common questions answered or ask questions on [GitHub Discussions]
34//! * leave feedback, ask questions, or greet us on [Gitter] or [Discord]
35//!
36//! [book]: https://pest.rs/book
37//! [docs.rs]: https://docs.rs/pest
38//! [fiddle]: https://pest.rs/#editor
39//! [Gitter]: https://gitter.im/pest-parser/pest
40//! [Discord]: https://discord.gg/XEGACtWpT2
41//! [GitHub Discussions]: https://github.com/pest-parser/pest/discussions
42//!
43//! ## Usage
44//!
45//! The core of pest is the trait [`Parser`], which provides an interface to the parsing
46//! functionality.
47//!
48//! The accompanying crate `pest_derive` can automatically generate a [`Parser`] from a PEG
49//! grammar. Using `pest_derive` is highly encouraged, but it is also possible to implement
50//! [`Parser`] manually if required.
51//!
52//! ## `.pest` files
53//!
54//! Grammar definitions reside in custom `.pest` files located in the crate `src` directory.
55//! Parsers are automatically generated from these files using `#[derive(Parser)]` and a special
56//! `#[grammar = "..."]` attribute on a dummy struct.
57//!
58//! ```ignore
59//! #[derive(Parser)]
60//! #[grammar = "path/to/my_grammar.pest"] // relative to src
61//! struct MyParser;
62//! ```
63//!
64//! The syntax of `.pest` files is documented in the [`pest_derive` crate].
65//!
66//! ## Inline grammars
67//!
68//! Grammars can also be inlined by using the `#[grammar_inline = "..."]` attribute.
69//!
70//! [`Parser`]: trait.Parser.html
71//! [`pest_derive` crate]: https://docs.rs/pest_derive/
72//!
73//! ## Grammar
74//!
75//! A grammar is a series of rules separated by whitespace, possibly containing comments.
76//!
77//! ### Comments
78//!
79//! Comments start with `//` and end at the end of the line.
80//!
81//! ```text
82//! // a comment
83//! ```
84//!
85//! ### Rules
86//!
87//! Rules have the following form:
88//!
89//! ```ignore
90//! name = optional_modifier { expression }
91//! ```
92//!
93//! The name of the rule is formed from alphanumeric characters or `_` with the condition that the
94//! first character is not a digit and is used to create token pairs. When the rule starts being
95//! parsed, the starting part of the token is being produced, with the ending part being produced
96//! when the rule finishes parsing.
97//!
98//! The following token pair notation `a(b(), c())` denotes the tokens: start `a`, start `b`, end
99//! `b`, start `c`, end `c`, end `a`.
100//!
101//! #### Modifiers
102//!
103//! Modifiers are optional and can be one of `_`, `@`, `$`, or `!`. These modifiers change the
104//! behavior of the rules.
105//!
106//! 1. Silent (`_`)
107//!
108//! Silent rules do not create token pairs during parsing, nor are they error-reported.
109//!
110//! ```ignore
111//! a = _{ "a" }
112//! b = { a ~ "b" }
113//! ```
114//!
115//! Parsing `"ab"` produces the token pair `b()`.
116//!
117//! 2. Atomic (`@`)
118//!
119//! Atomic rules do not accept whitespace or comments within their expressions and have a
120//! cascading effect on any rule they call. I.e. rules that are not atomic but are called by atomic
121//! rules behave atomically.
122//!
123//! Any rules called by atomic rules do not generate token pairs.
124//!
125//! ```ignore
126//! a = { "a" }
127//! b = @{ a ~ "b" }
128//!
129//! WHITESPACE = _{ " " }
130//! ```
131//!
132//! Parsing `"ab"` produces the token pair `b()`, while `"a b"` produces an error.
133//!
134//! 3. Compound-atomic (`$`)
135//!
136//! Compound-atomic are identical to atomic rules with the exception that rules called by them are
137//! not forbidden from generating token pairs.
138//!
139//! ```ignore
140//! a = { "a" }
141//! b = ${ a ~ "b" }
142//!
143//! WHITESPACE = _{ " " }
144//! ```
145//!
146//! Parsing `"ab"` produces the token pairs `b(a())`, while `"a b"` produces an error.
147//!
148//! 4. Non-atomic (`!`)
149//!
150//! Non-atomic are identical to normal rules with the exception that they stop the cascading effect
151//! of atomic and compound-atomic rules.
152//!
153//! ```ignore
154//! a = { "a" }
155//! b = !{ a ~ "b" }
156//! c = @{ b }
157//!
158//! WHITESPACE = _{ " " }
159//! ```
160//!
161//! Parsing both `"ab"` and `"a b"` produce the token pairs `c(a())`.
162//!
163//! #### Expressions
164//!
165//! Expressions can be either terminals or non-terminals.
166//!
167//! 1. Terminals
168//!
169//! | Terminal | Usage |
170//! |------------|----------------------------------------------------------------|
171//! | `"a"` | matches the exact string `"a"` |
172//! | `^"a"` | matches the exact string `"a"` case insensitively (ASCII only) |
173//! | `'a'..'z'` | matches one character between `'a'` and `'z'` |
174//! | `a` | matches rule `a` |
175//!
176//! Strings and characters follow
177//! [Rust's escape mechanisms](https://doc.rust-lang.org/reference/tokens.html#byte-escapes), while
178//! identifiers can contain alphanumeric characters and underscores (`_`), as long as they do not
179//! start with a digit.
180//!
181//! 2. Non-terminals
182//!
183//! | Non-terminal | Usage |
184//! |-----------------------|------------------------------------------------------------|
185//! | `(e)` | matches `e` |
186//! | `e1 ~ e2` | matches the sequence `e1` `e2` |
187//! | <code>e1 \| e2</code> | matches either `e1` or `e2` |
188//! | `e*` | matches `e` zero or more times |
189//! | `e+` | matches `e` one or more times |
190//! | `e{n}` | matches `e` exactly `n` times |
191//! | `e{, n}` | matches `e` at most `n` times |
192//! | `e{n,}` | matches `e` at least `n` times |
193//! | `e{m, n}` | matches `e` between `m` and `n` times inclusively |
194//! | `e?` | optionally matches `e` |
195//! | `&e` | matches `e` without making progress |
196//! | `!e` | matches if `e` doesn't match without making progress |
197//! | `PUSH(e)` | matches `e` and pushes it's captured string down the stack |
198//!
199//! where `e`, `e1`, and `e2` are expressions.
200//!
201//! Matching is greedy, without backtracking. Note the difference in behavior for
202//! these two rules in matching identifiers that don't end in an underscore:
203//!
204//! ```ignore
205//! // input: ab_bb_b
206//!
207//! identifier = @{ "a" ~ ("b"|"_")* ~ "b" }
208//! // matches: a b_bb_b nothing -> error!
209//!
210//! identifier = @{ "a" ~ ("_"* ~ "b")* }
211//! // matches: a b, _bb, _b in three repetitions
212//! ```
213//!
214//! Expressions can modify the stack only if they match the input. For example,
215//! if `e1` in the compound expression `e1 | e2` does not match the input, then
216//! it does not modify the stack, so `e2` sees the stack in the same state as
217//! `e1` did. Repetitions and optionals (`e*`, `e+`, `e{, n}`, `e{n,}`,
218//! `e{m,n}`, `e?`) can modify the stack each time `e` matches. The `!e` and `&e`
219//! expressions are a special case; they never modify the stack.
220//! Many languages have "keyword" tokens (e.g. if, for, while) as well as general
221//! tokens (e.g. identifier) that matches any word. In order to match a keyword,
222//! generally, you may need to restrict that is not immediately followed by another
223//! letter or digit (otherwise it would be matched as an identifier).
224//!
225//! ## Special rules
226//!
227//! Special rules can be called within the grammar. They are:
228//!
229//! * `WHITESPACE` - runs between rules and sub-rules
230//! * `COMMENT` - runs between rules and sub-rules
231//! * `ANY` - matches exactly one `char`
232//! * `SOI` - (start-of-input) matches only when a `Parser` is still at the starting position
233//! * `EOI` - (end-of-input) matches only when a `Parser` has reached its end
234//! * `PUSH` - matches a string and pushes it to the stack
235//! * `PUSH_LITERAL` - pushes a literal string to the stack
236//! * `POP` - pops a string from the stack and matches it
237//! * `POP_ALL` - pops the entire state of the stack and matches it
238//! * `PEEK` - peeks a string from the stack and matches it
239//! * `PEEK[a..b]` - peeks part of the stack and matches it
240//! * `PEEK_ALL` - peeks the entire state of the stack and matches it
241//! * `DROP` - drops the top of the stack (fails to match if the stack is empty)
242//!
243//! `WHITESPACE` and `COMMENT` should be defined manually if needed. All other rules cannot be
244//! overridden.
245//!
246//! ## `WHITESPACE` and `COMMENT`
247//!
248//! When defined, these rules get matched automatically in sequences (`~`) and repetitions
249//! (`*`, `+`) between expressions. Atomic rules and those rules called by atomic rules are exempt
250//! from this behavior.
251//!
252//! These rules should be defined so as to match one whitespace character and one comment only since
253//! they are run in repetitions.
254//!
255//! If both `WHITESPACE` and `COMMENT` are defined, this grammar:
256//!
257//! ```ignore
258//! a = { b ~ c }
259//! ```
260//!
261//! is effectively transformed into this one behind the scenes:
262//!
263//! ```ignore
264//! a = { b ~ WHITESPACE* ~ (COMMENT ~ WHITESPACE*)* ~ c }
265//! ```
266//!
267//! ## `PUSH`, `PUSH_LITERAL`, `POP`, `DROP`, and `PEEK`
268//!
269//! `PUSH(e)` simply pushes the captured string of the expression `e` down a stack. This stack can
270//! then later be used to match grammar based on its content with `POP` and `PEEK`.
271//!
272//! `PUSH_LITERAL("a")` pushes the argument to the stack without considering the input. The
273//! argument must be a literal string. This is often useful in conjunction with another rule before
274//! it. For example, `"[" ~ PUSH_LITERAL("]")` will look for an opening bracket `[` and, if it
275//! finds one, will push a closing bracket `]` to the stack. **Note**: `PUSH_LITERAL` requires the
276//! `grammar-extras` feature to be enabled.
277//!
278//! `PEEK` always matches the string at the top of stack. So, if the stack contains `["b", "a"]`
279//! (`"a"` being on top), this grammar:
280//!
281//! ```ignore
282//! a = { PEEK }
283//! ```
284//!
285//! is effectively transformed into at parse time:
286//!
287//! ```ignore
288//! a = { "a" }
289//! ```
290//!
291//! `POP` works the same way with the exception that it pops the string off of the stack if the
292//! match worked. With the stack from above, if `POP` matches `"a"`, the stack will be mutated
293//! to `["b"]`.
294//!
295//! `DROP` makes it possible to remove the string at the top of the stack
296//! without matching it. If the stack is nonempty, `DROP` drops the top of the
297//! stack. If the stack is empty, then `DROP` fails to match.
298//!
299//! ### Advanced peeking
300//!
301//! `PEEK[start..end]` and `PEEK_ALL` allow to peek deeper into the stack. The syntax works exactly
302//! like Rust’s exclusive slice syntax. Additionally, negative indices can be used to indicate an
303//! offset from the top. If the end lies before or at the start, the expression matches (as does
304//! a `PEEK_ALL` on an empty stack). With the stack `["c", "b", "a"]` (`"a"` on top):
305//!
306//! ```ignore
307//! fill = PUSH("c") ~ PUSH("b") ~ PUSH("a")
308//! v = { PEEK_ALL } = { "a" ~ "b" ~ "c" } // top to bottom
309//! w = { PEEK[..] } = { "c" ~ "b" ~ "a" } // bottom to top
310//! x = { PEEK[1..2] } = { PEEK[1..-1] } = { "b" }
311//! y = { PEEK[..-2] } = { PEEK[0..1] } = { "a" }
312//! z = { PEEK[1..] } = { PEEK[-2..3] } = { "c" ~ "b" }
313//! n = { PEEK[2..-2] } = { PEEK[2..1] } = { "" }
314//! ```
315//!
316//! For historical reasons, `PEEK_ALL` matches from top to bottom, while `PEEK[start..end]` matches
317//! from bottom to top. There is currently no syntax to match a slice of the stack top to bottom.
318//!
319//! ## `Rule`
320//!
321//! All rules defined or used in the grammar populate a generated `enum` called `Rule`. This
322//! implements `pest`'s `RuleType` and can be used throughout the API.
323//!
324//! ## `Built-in rules`
325//!
326//! Pest also comes with a number of built-in rules for convenience. They are:
327//!
328//! * `ASCII_DIGIT` - matches a numeric character from 0..9
329//! * `ASCII_NONZERO_DIGIT` - matches a numeric character from 1..9
330//! * `ASCII_BIN_DIGIT` - matches a numeric character from 0..1
331//! * `ASCII_OCT_DIGIT` - matches a numeric character from 0..7
332//! * `ASCII_HEX_DIGIT` - matches a numeric character from 0..9 or a..f or A..F
333//! * `ASCII_ALPHA_LOWER` - matches a character from a..z
334//! * `ASCII_ALPHA_UPPER` - matches a character from A..Z
335//! * `ASCII_ALPHA` - matches a character from a..z or A..Z
336//! * `ASCII_ALPHANUMERIC` - matches a character from a..z or A..Z or 0..9
337//! * `ASCII` - matches a character from \x00..\x7f
338//! * `NEWLINE` - matches either "\n" or "\r\n" or "\r"
339
340#![doc(html_root_url = "https://docs.rs/pest")]
341
342extern crate alloc;
343#[cfg(feature = "std")]
344extern crate std;
345
346pub use crate::parser::Parser;
347pub use crate::parser_state::{
348 set_call_limit, set_error_detail, state, Atomicity, Lookahead, MatchDir, ParseResult,
349 ParserState,
350};
351pub use crate::position::Position;
352pub use crate::span::{merge_spans, Lines, LinesSpan, Span};
353pub use crate::stack::Stack;
354pub use crate::token::Token;
355use core::fmt::Debug;
356use core::hash::Hash;
357
358pub mod error;
359pub mod iterators;
360mod macros;
361mod parser;
362mod parser_state;
363mod position;
364pub mod pratt_parser;
365#[deprecated(
366 since = "2.4.0",
367 note = "Use `pest::pratt_parser` instead (it is an equivalent which also supports unary prefix/suffix operators).
368While prec_climber is going to be kept in 2.x minor and patch releases, it may be removed in a future major release."
369)]
370pub mod prec_climber;
371mod span;
372mod stack;
373mod token;
374
375#[doc(hidden)]
376pub mod unicode;
377
378/// A trait which parser rules must implement.
379///
380/// This trait is set up so that any struct that implements all of its required traits will
381/// automatically implement this trait as well.
382///
383/// This is essentially a [trait alias](https://github.com/rust-lang/rfcs/pull/1733). When trait
384/// aliases are implemented, this may be replaced by one.
385pub trait RuleType: Copy + Debug + Eq + Hash + Ord {}
386
387impl<T: Copy + Debug + Eq + Hash + Ord> RuleType for T {}