Crate lang_pt

source ·
Expand description

Language parsing tool (lang_pt) is a library to generate a recursive descent top-down parser to parse languages or text into Abstract Syntax Tree (AST).

Overview

Parsers written for the languages like Javascript are often custom handwritten due to the complexity of the languages. However, writing custom parser code often increases development and maintenance costs for the parser. With an intention to reduce development efforts, the library has been created for building a parser for a high-level language (HLL). The goal for this library is to develop a flexible library to support a wide range of grammar keeping a fair performance in comparison to a custom-written parser.

Design

A language parser is usually developed either by writing custom code by hand or using a parser generator tool. While building a parser using a parser generator, grammar for the language is implemented in a Domain Specific Language (DSL) specified by the generator tool. The generator will then compile the grammar and generate a parser code in the target runtime language. However, this parser library uses a set of production utilities to implement grammar in the rust programming language. Therefore, instead of writing grammar in the generator-specified language, one can make use of utilities like Concat, Union, etc. to implement concatenation and alternative production of symbols.

This parsing tool is also equipped with utilities like Lookahead, Validator, and NonStructural to support custom validation, precedence-based parsing, etc. This parsing library can be used to parse a wide range of languages which often require custom functionality to be injected into the grammar. Moreover, the library also includes production utilities like SeparatedList, and Suffixes, to ease writing grammar for a language.

Example

Following is the JSON program implementation using lang_pt.

// # Tokenization

use lang_pt::production::ProductionBuilder;
use lang_pt::{
    lexeme::{Pattern, Punctuations},
    production::{Concat, EOFProd, Node, SeparatedList, TokenField, TokenFieldSet, Union},
    DefaultParser, NodeImpl, TokenImpl, Tokenizer,
};
use std::rc::Rc;

#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Hash, Clone, Copy)]
// JSON token
pub enum JSONToken {
    EOF,
    String,
    Space,
    Colon,
    Comma,
    Number,
    Constant,
    OpenBrace,
    CloseBrace,
    OpenBracket,
    CloseBracket,
}

#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Hash, Clone, Copy)]
// Node value for AST
pub enum JSONNode {
    Key,
    String,
    Number,
    Constant,
    Array,
    Object,
    Item,
    Main,
    NULL,
}

impl TokenImpl for JSONToken {
    fn eof() -> Self { JSONToken::EOF }
    fn is_structural(&self) -> bool {
        match self {
            JSONToken::Space => false,
            _ => true,
        }
    }
}
impl NodeImpl for JSONNode {
    fn null() -> Self { JSONNode::NULL }
}

let punctuations = Rc::new(
    Punctuations::new(vec![
        ("{", JSONToken::OpenBrace),
        ("}", JSONToken::CloseBrace),
        ("[", JSONToken::OpenBracket),
        ("]", JSONToken::CloseBracket),
        (",", JSONToken::Comma),
        (":", JSONToken::Colon),
    ])
    .unwrap(),
);

let dq_string = Rc::new(
    Pattern::new(
        JSONToken::String,
        r#"^"([^"\\\r\n]|(\\[^\S\r\n]*[\r\n][^\S\r\n]*)|\\.)*""#, //["\\bfnrtv]
    )
    .unwrap(),
);

let lex_space = Rc::new(Pattern::new(JSONToken::Space, r"^\s+").unwrap());
let number_literal = Rc::new(
    Pattern::new(JSONToken::Number, r"^([0-9]+)(\.[0-9]+)?([eE][+-]?[0-9]+)?").unwrap(),
);
let const_literal = Rc::new(Pattern::new(JSONToken::Constant, r"^(true|false|null)").unwrap());

let tokenizer = Tokenizer::new(vec![
    lex_space,
    punctuations,
    dq_string,
    number_literal,
    const_literal,
]);

// # Parser

let eof = Rc::new(EOFProd::new(None));

let json_key = Rc::new(TokenField::new(JSONToken::String, Some(JSONNode::Key)));

let json_primitive_values = Rc::new(TokenFieldSet::new(vec![
    (JSONToken::String, Some(JSONNode::String)),
    (JSONToken::Constant, Some(JSONNode::Constant)),
    (JSONToken::Number, Some(JSONNode::Number)),
]));


let hidden_open_brace = Rc::new(TokenField::new(JSONToken::OpenBrace, None));
let hidden_close_brace = Rc::new(TokenField::new(JSONToken::CloseBrace, None));
let hidden_open_bracket = Rc::new(TokenField::new(JSONToken::OpenBracket, None));
let hidden_close_bracket = Rc::new(TokenField::new(JSONToken::CloseBracket, None));
let hidden_comma = Rc::new(TokenField::new(JSONToken::Comma, None));
let hidden_colon = Rc::new(TokenField::new(JSONToken::Colon, None));
let json_object = Rc::new(Concat::init("json_object"));
let json_value_union = Rc::new(Union::init("json_value_union"));
let json_object_item = Rc::new(Concat::new(
    "json_object_item",
    vec![
        json_key.clone(),
        hidden_colon.clone(),
        json_value_union.clone(),
    ],
));

let json_object_item_node = Rc::new(Node::new(&json_object_item, Some(JSONNode::Item)));
let json_object_item_list =
    Rc::new(SeparatedList::new(&json_object_item_node, &hidden_comma, true).into_nullable());
let json_array_item_list =
    Rc::new(SeparatedList::new(&json_value_union, &hidden_comma, true).into_nullable());
let json_array_node = Rc::new(
    Concat::new(
        "json_array",
        vec![
            hidden_open_bracket.clone(),
            json_array_item_list.clone(),
            hidden_close_bracket.clone(),
        ],
    )
    .into_node(Some(JSONNode::Array)),
);

let json_object_node = Rc::new(Node::new(&json_object, Some(JSONNode::Object)));

json_value_union
    .set_symbols(vec![
        json_primitive_values.clone(),
        json_object_node.clone(),
        json_array_node.clone(),
    ])
    .unwrap();

json_object
    .set_symbols(vec![
        hidden_open_brace.clone(),
        json_object_item_list,
        hidden_close_brace.clone(),
    ])
    .unwrap();

let main = Rc::new(Concat::new("root", vec![json_value_union, eof]));
let main_node = Rc::new(Node::new(&main, Some(JSONNode::Main)));
let parser = DefaultParser::new(Rc::new(tokenizer), main_node).unwrap();

License

lang_pt is provided under the MIT license. See LICENSE.

Modules

The module consists of examples of grammar implementations for parsing languages with the parser tool.
A module consists of lexeme utilities which analyze string slices at incremental positions of the input and create tokens.
A module consists of production utilities which are helper utilities to write the grammar for the parser.

Structs

Abstract Syntax tree (AST) of the parsed input.
An object structure to store maximum successful parse position and parsed result for Packrat parsing technique.
A unique key to save and retrieve parsed results for the Packrat parsing technique.
A wrapper for the input language to be parsed with lines information.
A state-based tokenizer for lexical analysis.
A parser structure to construct a tokenized based parsing program.
A wrapper implementation to indicate the indices of structural tokens of the TokenStream.
An error returned due to failed validation of production utilities and grammar.
Element of the tokenized data.
A parser structure for parsing input without a tokenizer.
An error returned when the parser failed to parse the input because of the language syntax error.
The line and column information at code point.
A Ok result value returned from the production utility when it successfully consume production derivation.
A wrapper to indicate the index of the tokenized data in the TokenStream.
A wrapper implementation of the tokenized data.
Base tokenization structure for lexical analysis.

Enums

A enum structure to assign multiple level debugging to lexeme and production utilities.
An error to indicate failure while consuming input into AST.

Traits

An interface implemented by all lexeme utilities which are primary element of a tokenizer.
A trait implemented by production utilities which are used to write the various production rule for writing the grammar.
A trait consists of tokenize method which takes input utf-8 string bytes and produces a tokens stream.
A trait implementation to generate default tokens to assign token values to the associated ASTNode.
A trait implementation to generate default tokens to assign token values to the associated ASTNode.

Type Definitions

A result returned from Production when it try to consume inputs.