Crate lib_ruby_parser

source ·
Expand description

§lib-ruby-parser

test unsafe forbidden Crates.io codecov MIT Licence dependency status Docs

lib-ruby-parser is a Ruby parser written in Rust.

Basic usage:

use lib_ruby_parser::{Parser, ParserOptions};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let options = ParserOptions {
        buffer_name: "(eval)".to_string(),
        ..Default::default()
    };
    let mut parser = Parser::new(b"2 + 2".to_vec(), options);

    println!("{:#?}", parser.do_parse());

    Ok(())
}

Full documentation

§Features

TLDR; it’s fast, it’s precise, and it has a beautiful interface.

Comparison with Ripper/RubyVM::AST:

  1. It’s based on MRI’s parse.y, and so it returns exactly the same sequence of tokens.
  2. It’s been tested on top 300 gems (by total downloads, that’s about 4M LOC), rubyspec and ruby/ruby repos and there’s no difference with Ripper.lex.
  3. It’s ~5 times faster than Ripper, Ripper parses 4M LOC in ~24s, lib-ruby-parser does it in ~4.5s. That’s ~950K LOC/s. You can find benchmarks in the bench/ directory, they don’t include any IO or GC.
  4. It has a much, much better interface. AST is strongly typed and well documented.
  5. It doesn’t throw away information about tokens. All nodes have information about their source locations.

Comparison with whitequark/parser:

  1. It’s much faster (the same corpus of 4M LOC can be parsed in 245s on the same machine)
  2. It has a very similar interface (both in terms of AST structure and errors reporting)
  3. However, AST is strongly typed, and so if something is nullable it’s explicitly defined and documented.
  4. What’s important, it doesn’t depend on Ruby

Testing corpus has 4,176,379 LOC and 170,114,575 bytes so approximate parsing speed on my local machine is:

ParserTotal timeBytes per secondLines per second
lib-ruby-parser~4.4s~38,000,000~950,000
ripper~24s~7,000,000~175,000
whitequark/parser~245s~700,000~17,000

§Grammar versioning

lib-ruby-parser follows MRI/master. There are no plans to support multiple versions like it’s done in whitequark/parser.

§Library versioning

Ruby versionlib-ruby-parser version
3.0.03.0.0+
3.1.04.0.0+ruby-3.1.0

Starting from 4.0.0 lib-ruby-parser follows SemVer. Base version increments according to API changes, while metadata matches current Ruby version, i.e. X.Y.Z+ruby-A.B.C means:

  • X.Y.Z base version
  • that parses Ruby A.B.C

Both versions bump separately.

§Encodings

By default lib-ruby-parser can only parse source files encoded in UTF-8 or ASCII-8BIT/BINARY.

It’s possible to pass a decoder function in ParserOptions that takes a recognized (by the library) encoding and a byte array. It must return a UTF-8 encoded byte array or an error:

use lib_ruby_parser::source::{InputError, Decoder, DecoderResult};
use lib_ruby_parser::{Parser, ParserOptions, ParserResult, LocExt};

fn decode(encoding: String, input: Vec<u8>) -> DecoderResult {
    if "US-ASCII" == encoding.to_uppercase() {
        // reencode and return Ok(result)
        return DecoderResult::Ok(b"# encoding: us-ascii\ndecoded".to_vec());
    }
    DecoderResult::Err(InputError::DecodingError(
        "only us-ascii is supported".to_string(),
    ))
}

let options = ParserOptions {
    decoder: Some(Decoder::new(Box::new(decode))),
    ..Default::default()
};
let mut parser = Parser::new(b"# encoding: us-ascii\n3 + 3".to_vec(), options);
let ParserResult { ast, input, .. } = parser.do_parse();

assert_eq!(ast.unwrap().expression().source(&input).unwrap(), "decoded".to_string())

§Invalid string values

Ruby doesn’t require string literals to be valid in their encodings. This is why the following code is valid:

# encoding: utf-8

"\xFF"

Byte sequence 255 is invalid in UTF-8, but MRI ignores it.

But not all languages support it, and this is why string and symbol nodes encapsulate a custom StringValue instead of a plain String.

If your language supports invalid strings you can use raw .bytes of this StringValue. For example, a Ruby wrapper for this library could do that.

If your language doesn’t support it, better call .to_string_lossy() that replaces all unsupported chars with a special U+FFFD REPLACEMENT CHARACTER (�).

§Regexes

Ruby constructs regexes from literals during parsing to:

  1. validate them
  2. declare local variables if regex is used for matching AND it contains named captures

To mirror this behavior lib-ruby-parser uses Onigurama to compile, validate and parse regex literals.

This feature is disabled by default, but you can add it by enabling "onig" feature.

§Bison

The grammar of lib-ruby-parser is built using a custom bison skeleton that was written for this project.

For development you need the latest version of Bison installed locally. Of course, it’s not necessary for release builds from crates.io (because compiled parser.rs is included into release build AND build.rs that converts it is excluded).

If you use it from GitHub directly you also need Bison (because parser.rs is under gitignore)

§Bindings for other languages

§Profiling

You can use parse example:

$ cargo run --bin parse --features=bin-parse -- --print=N --run-profiler --glob "blob/**/*.rb"

§Benchmarking

A codebase of 4M LOCs can be generated using a download.rb script:

$ ruby gems/download.rb

Then, run a script that compares Ripper and lib-ruby-parser (attached results are from Mar 2024):

$ ./scripts/bench.sh
Running lib-ruby-parser
Run 1:
Time taken: 4.4287733330 (total files: 17895)
Run 2:
Time taken: 4.4292764170 (total files: 17895)
Run 3:
Time taken: 4.4460961250 (total files: 17895)
Run 4:
Time taken: 4.4284508330 (total files: 17895)
Run 5:
Time taken: 4.4695665830 (total files: 17895)
--------
Running MRI/ripper
Run 1:
Time taken: 24.790103999897838 (total files: 17894)
Run 2:
Time taken: 23.145863000303507 (total files: 17894)
Run 3:
Time taken: 25.50493900012225 (total files: 17894)
Run 4:
Time taken: 24.570900999940932 (total files: 17894)
Run 5:
Time taken: 26.0963700003922 (total files: 17894)

§Fuzz testing

First, make sure to switch to nightly:

$ rustup default nightly

Then install cargo-fuzz:

$ cargo install cargo-fuzz

And run the fuzzer (change the number of --jobs as you need or remove it to run only 1 parallel process):

$ RUST_BACKTRACE=1 cargo fuzz run parse --jobs=8 -- -max_len=50

A Ruby parser written in Rust.

Uses bison under the hood.

Modules§

  • Mod with all known lex states
  • Module with all known node types
  • Module with everything related to output of the Parser, but not related to AST, like Comment, Input, Decoder
  • Module to perform recursive traversing

Structs§

  • Representation of a byte sequence
  • Diagnostic message that comes from the parser when there’s an error or warning
  • State of the lexer
  • A struct responsible for converting a given input into a sequence of tokens
  • Representation of any location in the given input
  • A Bison parser, automatically generated from src/parser/parse.y.
  • Configuration of the parser
  • Combination of all data that Parser can give you
  • Representation of a reserved word in Ruby
  • Stack of local variables in nested scopes
  • A token that is emitted by a lexer and consumed by a parser

Enums§

  • Enum of all possible diagnostic message (both warnings and errors)
  • Error level of the diagnostic message
  • Generic combination of all known nodes.

Traits§

  • A set of parser-specific extensions for Loc

Functions§