rustlr 0.2.6

LR(1)/LALR(1) parser generator for rust
Documentation
<h2 id="chapter-1-unambiguous-lr-grammar-for-simple-calculator">Chapter 1: Unambiguous LR Grammar for Simple Calculator.</h2>
<p>Please note that this tutorial has been rewritten for <strong><a href="https://docs.rs/rustlr/latest/rustlr/index.html">Rustlr version 0.2.x</a></strong>, which contains significant changes over the 0.1.x versions, although it remains compatible with parsers already created. The original version of this chapter is available <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1grammar0.html">here</a>.</p>
<p>This tutorial is written for those with sufficient background in computer science and in Rust programming, with some knowledge of context free grammars and basic LR parsing concepts. Those who are already familiar with similar LR parser generation tools may wish to skip to the more advanced example in <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/calculatorgrammar.html">Chapter 2</a>.</p>
<p>The tutorial will start with a sample grammar.</p>
<pre class="ignore"><code>valuetype i32
nonterminals E T F
terminals + * ( ) num
topsym E

E --&gt; E:e + T:t { e.value + t.value }
E --&gt; T:t { t.value }
T --&gt; T:(t) * F:(f) { t*f }
T --&gt; F:(f) { f }
F --&gt; ( E:e )  { e.value }
F --&gt; num:n { n.value }

EOF
</code></pre>
<p>These are the contents of a Rustlr grammar file, called <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1.grammar">test1.grammar</a>. This classic example of LR parsing is found in virtually all compiler textbooks. It is an unambiguous grammar. After you <strong><code>cargo install rustlr</code></strong> you can produce a LALR parser from this grammar file with:</p>
<blockquote>
<p>rustlr test1.grammar</p>
</blockquote>
<p>The first and the only required argument to the executable is the path of the grammar file. Optional arguments (after the grammar path) that can be given to the executable are:</p>
<ul>
<li><strong>-lr1</strong> : this will create a full LR(1) parser if LALR does not suffice. The default is LALR, which works for most examples. A sample grammar requiring full LR(1) can be found <strong><a href="https://cs.hofstra.edu/~cscccl/rustlr_project/nonlalr.grammar">here</a>.</strong> Rustlr will always try to resolve shift-reduce conflicts by precedence and associativity declarations (see later examples) and reduce-reduce conflicts by rule order. So it will generate some kind of parser in any case. The next chapter will explain in detail how conflicts are resolved.</li>
<li><strong>-o filepath</strong> : changes the default destination of the generated parser, which is a file called <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1parser.rs">test1parser.rs</a>.</li>
<li><strong>-trace n</strong> : where n is a non-negative integer defining the trace level. Level 0 prints nothing; level 1, which is the default, prints a little more information. Each greater level will print all information in lower levels. -trace 3 will print the states of the LR finite state machine, which could be useful for debugging and training the parser for error message output.</li>
<li><strong>-nozc</strong> : this produces an older version of the runtime parser that does not use the new zero-copy lexical analyzer trait. This option is only retained for backwards compatibility with grammars and lexical scanners written prior to rustlr version 0.2.0.</li>
</ul>
<p>The generated parser will be a program <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1parser.rs">test1parser.rs</a> that contains a <strong><code>make_parser</code></strong> function. RustLr will derive the name of the grammar (test1) from the file path, unless there is a declaration of the form</p>
<blockquote>
<p>grammarname somename</p>
</blockquote>
<p>in the grammar spec, in which case the parser generated will be called "somenameparser.rs". The parser must import some elements of rustlr so it should be used in a crate. We will come back to how to use the generated parser later.</p>
<h4 id="grammar-format">GRAMMAR FORMAT</h4>
<p>The first line in the grammar specification:</p>
<blockquote>
<p>valuetype i32</p>
</blockquote>
<p>(alternatively <code>absyntype i32</code>) defines the type of value returned by the parser. In most cases that would be some enum that defines an abstract syntax tree, but here we will just calculate an i32 value. The default valuetype (if none declared) is i64.</p>
<p><strong>The valuetype you choose must implement the Default trait.</strong></p>
<p>RustLr requires that all grammar symbols be defined before any production rules using multiple "nonterminals" or "terminals" directives.</p>
<h4 id="top-nonterminal">Top Nonterminal</h4>
<blockquote>
<p>topsym E</p>
</blockquote>
<p>You should designate one particular non-terminal symbol as the top symbol: The parser generator will always create an extra production rule of the form <code>START --&gt;  topsym EOF</code></p>
<h4 id="grammar-production-rules">Grammar Production Rules</h4>
<p>You will get an error message if the grammar symbols are not defined before the grammar rules. Each rule is indicated by a non-terminal symbol followed by <code>--&gt;</code>, <code>::=</code> , or <code>==&gt;</code>. The symbol <code>::=</code> is interpreted to be the same as <code>--&gt;</code>. <code>==&gt;</code> is for rules that span multiple lines that you will find used in other grammars (later chapters). You can specify multiple production rules with the same left-hand side nonterminal using | which you will also find used in other grammars.</p>
<p>The right hand side of each rule must separate each symbol with whitespaces. For each grammar symbol such as E, you can optionally bind a "label" such as <code>E:a</code>, <code>E:(a)</code>, <code>E:@pattern@</code> or <code>E:v@pattern@</code>. Each type of binding carries a different meaning and affects how they will be used in the semantic action part of the rule. The grammar used in this Chapter will only use the first two forms: <code>a</code> and <code>(a)</code>.</p>
<p>The right-hand side of a rule may be empty, which will make the non-terminal on the left side of <code>--&gt;</code> "nullable".</p>
<h4 id="semantic-actions">SEMANTIC ACTIONS</h4>
<p>Each rule can optionally end with a semantic action inside { and }, which can only follow all grammar symbols making up the right-hand side of the production rule. This is a piece of Rust code that would be injected <em>verbatim</em> into the generated parser. This code will have access to any labels associated with the symbols defined using ":". In a label such as <code>E:e</code>, e is of type <a href="https://docs.rs/rustlr/latest/rustlr/zc_parser/struct.StackedItem.html">StackedItem</a>, which includes the following fields:</p>
<ul>
<li><strong>.value</strong> : <code>e.value</code> refers to the semantic value associated with this symbol, which in this case is of type i32 but in general will be of the type defined by the "valuetype" or "absyntype" directive.</li>
<li><strong>.line</strong> : the line number in the original source where this syntactic construct begins. Lines start at 1.</li>
<li><strong>.column</strong> : the column number (character position on the line) where this syntactic construct begins. Columns start at 1.</li>
</ul>
<p>However, if we are only interested in the .value of the label, we can also capture the value directly using the form demonstrated by <code>T:(t)</code>: in this case <code>t</code> refers <em>only</em> to the .value of the popped StackedItem. In case the valuetype can be described by an irrefutable pattern, such as <code>(i32,i32)</code>, a label such as <code>E:(a,b)</code> can also used to directly capture the value. The other kinds of labels (with the <code>@</code> symbol) will be described in the next chapter.</p>
<p><strong>The semantic action code must return a value of type valuetype</strong> (in this case i32). If no semantic action is given, then a default one is created that just returns valuetype::default(), which is why the valuetype must implement the Default trait. Here's an example, taken from the generated parser, of how the code is injected:</p>
<pre><code>rule.Ruleaction = |parser|{ let mut t = parser.popstack(); let mut _item1_ = parser.popstack(); let mut e = parser.popstack();  e.value + t.value };
</code></pre>
<p>This is the semantic action generated from the rule</p>
<blockquote>
<pre><code> E --&gt; E:e + T:t { e.value + t.value }
</code></pre>
</blockquote>
<p>Notice that if a symbol carries no label, then rustlr generates a name <code>_item{n}_</code> for it. The parser generator is not responsible if you write an invalid semantic action that's rejected by the Rust compiler. Within the { } block, you may also call other actions on the parser, including reporting error messages and telling the parser to abort. However, you should not try to "pop the stack" or change the parser state in other ways: leave that to the generated code.</p>
<h4 id="creating-a-lexer-and-invoking-the-parser"><strong>CREATING A LEXER AND INVOKING THE PARSER</strong></h4>
<p>Here is the <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1main.rs">main.rs</a> associated with this grammar, which forms a simple calculator. Its principal contents define a lexical analyzer that conforms to the <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/trait.Tokenizer.html">Tokenizer</a> trait.</p>
<pre><code>struct Scanner&lt;&#39;t&gt;(StrTokenizer&lt;&#39;t&gt;);
impl&lt;&#39;t&gt; Tokenizer&lt;&#39;t,i32&gt; for Scanner&lt;&#39;t&gt;
{
   // this function must convert any kind of token produced by the lexer
   // into TerminalTokens expected by the parser.  The built-in lexer,
   // StrTokenizer, produces RawTokens along with their line/column numbers.
   fn nextsym(&amp;mut self) -&gt; Option&lt;TerminalToken&lt;&#39;t,i32&gt;&gt;   {
     let tokopt = self.0.next_token();
     if let None = tokopt {return None;}
     let tok = tokopt.unwrap();
     match tok.0 {  // tok.1,tok.2 are line,column numbers
       RawToken::Num(n) =&gt; Some(TerminalToken::from_raw(tok,&quot;num&quot;,n as i32)),
       RawToken::Symbol(s) =&gt; Some(TerminalToken::from_raw(tok,s,0)),
       _ =&gt; Some(TerminalToken::from_raw(tok,&quot;&lt;&lt;Lexical Error&gt;&gt;&quot;,0)),
     }//match
   }
}
fn main() {
  let mut input = &quot;5+2*3&quot;;
  let args:Vec&lt;String&gt; = std::env::args().collect(); // command-line args
  if args.len()&gt;1 {input = &amp;args[1];}
  let mut parser1 = zc1parser::make_parser();
  let mut tokenizer1 =Scanner(StrTokenizer::from_str(input));
  let result = parser1.parse(&amp;mut tokenizer1);
  println!(&quot;result after parsing {}: {}&quot;,input,result);  
}//main
</code></pre>
<p>To run the program, <strong><code>cargo new</code></strong> a new crate and copy the contents of [main.rs](<a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1main.rs">https://cs.hofstra.edu/~cscccl/rustlr_project/test1main.rs</a> and <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/test1parser.rs">test1parser.rs</a> to src/main.rs and src/test1parser.rs respectively. Add to Cargo.toml under [dependencies]:</p>
<pre><code>rustlr = &quot;0.2&quot;  
</code></pre>
<p><strong><code>cargo run "2+3*4"</code></strong> will print 14 and <code>cargo run "(2+3)*4"</code> will print 20.</p>
<h4 id="rustlrs-lexical-interface">RustLr's Lexical Interface</h4>
<p>To create a lexical scanner for your grammar, you must become familiar with the <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/trait.Tokenizer.html">Tokenizer</a> trait and the <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.TerminalToken.html">TerminalToken</a> struct which are defined by rustlr:</p>
<pre><code>pub struct TerminalToken&lt;&#39;t,AT:Default&gt;
{
  pub sym: &amp;&#39;t str,
  pub value: AT,
  pub line:usize,
  pub column:usize,
}
pub trait Tokenizer&lt;&#39;t,AT:Default&gt; 
{
  fn nextsym(&amp;mut self) -&gt; Option&lt;TerminalToken&lt;&#39;t,AT&gt;&gt;;
  //... the above is the only required function when impl Tokenizer
}  
</code></pre>
<p>TerminalTokens are the structures expected by rustlr's built-in runtime parser (called <a href="https://docs.rs/rustlr/latest/rustlr/zc_parser/struct.ZCParser.html">ZCParser</a>, whereas RuntimeParser refers to an older version). <strong>The .sym field of each token must correspond to the name of a terminal symbol of the grammar</strong> being parsed. The value must be of the valuetype or 'absyntype' of the grammar. Each TerminalToken also includes the starting line and column of where the token begins in the source.</p>
<p>The parser requires a ref mut to a Tokenizer-trait object as an argument.</p>
<p>The nextsym function of the trait object must produce Some(TerminalToken) until end of input is reached, at which point it should return None. The <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.TerminalToken.html#method.new">TerminalToken::new</a> function can be called to create a new token. Very importantly, the "sym" &amp;str of the TerminalToken must match identically with the name of a terminal symbol of your grammar (yes that's worth repeating). The "value" of the token is something of type valuetype/absyntype as defined by the grammar. In this case each integer constant must be translated into a token with .sym=="num" and .value = the value of integer as an i32.</p>
<p>This example uses the built-in <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.StrTokenizer.html">StrTokenizer</a> as lexer. This tokenizer suffices for the examples that have been so-far created by rustlr. It is capable of recognizing multi-line string literals and comments, alphanumeric and non alpha-numeric symbols, decimal and hexadecimal constants (unsigned), floating point constants, character literals such as <code>'a'</code>. It also has the option of returning newline and whitespaces (with count) as tokens. But it does have limitations. It is not the most efficient (not always one-pass, uses regex). It returns all integers as i64 (it would recognize "1u8" as two separate tokens, a number and an alphanumeric symbol "u8"). Negative integers must also be recognized at the parser as opposed to lexer level. The lexer was not designed to recognize binary input. But StrTokenizer does "get the job done" in many cases that are required in compiling and analyzing source code.</p>
<p><a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.StrTokenizer.html">StrTokenizer</a> produces a structure called <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/enum.RawToken.html">RawToken</a>. The <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.TerminalToken.html#method.from_raw">TerminalToken::from_raw</a> function converts a tuple that consists of (RawToken,line,column) into a TerminalToken. <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/enum.RawToken.html">RawToken</a> is an enum that includes <code>Num</code> that carries an i64 value, and <code>Symbol</code>, which carries a string of non-alphanumeric symbols such as <code>*</code>.</p>
<p>Besides the <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.TerminalToken.html#method.from_raw">TerminalToken::from_raw</a> function, there is no link between the specific tokenizer and the parser. Any lexer can be adopted to impl the <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/trait.Tokenizer.html">Tokenizer</a> trait by converting whatever kind of tokens they produce into TerminalTokens in the <strong><a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/trait.Tokenizer.html#tymethod.nextsym">nextsym</a></strong> function required by the trait.</p>
<p>An instance of the runtime parser is created by calling the <strong><code>make_parser</code></strong> function, which is the only exported function of the generated parser. Once a lexer has also been created, parsing can commence by calling</p>
<blockquote>
<pre><code> `parser1.parse(&amp;mut tokenizer1)`
</code></pre>
</blockquote>
<p>This function will return a value of type valuetype. It will return a valuetype-value even if parsing failed (but error messages will be printed). After .parse returns, you can also check if an error had occurred by calling <code>parser1.error_occurred()</code> before deciding to use the valuetype result that was returned.</p>
<h4 id="reserved-symbols">Reserved Symbols</h4>
<p>The following terminal symbols are reserved and should not be used in a grammar:</p>
<blockquote>
<pre><code> EOF   ANY_ERROR   :  |  @  {  }  --&gt;  ::=  ==&gt;  &lt;==  
</code></pre>
</blockquote>
<p>The following symbols should also NOT be used as non-terminals in your grammar:</p>
<blockquote>
<pre><code>START valuetype absyntype grammarname resync resynch topsym errsym 
nonterminal terminal nonterminals terminals flexname lexname typedterminal
left right externtype externaltype lifetime
</code></pre>
</blockquote>
<p>For example, if ":" is to be one of the terminal symbols of your language, then you should call it something like COLON inside the grammar. You will then adopt your lexical analyzer so that ":" is translated into a <a href="https://docs.rs/rustlr/latest/rustlr/lexer_interface/struct.TerminalToken.html">TerminalToken</a> with .sym="COLON" before sending the token to the parser. If you want to treat a whitespace as a token your lexer must similarly translate whitespaces into something like WHITESPACE. Non-terminal symbol START and terminal EOF will always be added as additional symbols to the grammar. The other symbols that should not be used for non-terminals are for avoiding clash with grammar directives.</p>
<p>The following identifiers (variable names) are reserved and should only be used carefully from within the semantic actions of a grammar production (rust code inside {}s):</p>
<ul>
<li><strong><code>parser</code></strong> : the code generated from the semantic actions is of the form <code>|parser|{...}</code>. The <em>parser</em> refers to the instance of the runtime parser <a href="https://docs.rs/rustlr/latest/rustlr/zc_parser/struct.ZCParser.html">ZCParser</a>. It is valid to invoke certain functions on this object inside the semantic actions, including <a href="https://docs.rs/rustlr/latest/rustlr/zc_parser/struct.ZCParser.html#method.report">parser.report</a> (to report an error message), <a href="https://docs.rs/rustlr/latest/rustlr/zc_parser/struct.ZCParser.html#method.abort">parser.abort</a> and most importantly, <a href="https://docs.rs/rustlr/latest/rustlr/zc_parser/struct.ZCParser.html#method.lbx">parser.lbx</a>, which forms an <a href="https://docs.rs/rustlr/latest/rustlr/generic_absyn/struct.LBox.html">LBox</a> smartpointer by inserting into it line/column information that accompanies an abstract syntax value (see next chapter). However, there are other functions on parser that are exported, but should only be called by the automatically generated portion of the code. For example, calling parser.popstack() would remove an extra state/value from the parse stack and corrupt the core parsing algorithm.</li>
<li><strong><code>_item0_, item1_, item{n}_</code></strong> : these variables may be generated to hold the values that are popped from the stack.</li>
<li><strong><code>SYMBOLS, TABLE</code></strong>: these are constant arrays holding essential information about the LR state machine.</li>
<li>function names <strong><code>make_parser</code></strong>, <strong><code>load_extras</code></strong>, <strong><code>_semaction_for_{n}_</code></strong></li>
</ul>
<h4 id="a-self-contained-example"><strong>A self-contained example</strong></h4>
<p>Most rustlr projects will consist of mulitple files: the .grammar file, a module defining the abstract syntax type, a module defining a lexical analyzer, the generated parser as another module, and presumably a main to launch the program. In <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/brackets.grammar">this additional example</a>, enough code has been injected into the .grammar so that rustlr can generate a relatively <a href="https://cs.hofstra.edu/~cscccl/rustlr_project/bracketsparser.rs">self-contained program</a>, that includes a lexer and a main, and illustrates a few extra features of Rustlr. This example also uses charscanner, which is another tokenizer that comes with Rustlr, this time designed to parse one character at a time.</p>
<hr />