Crate nlprule[][src]

Rule-based grammatical error correction through parsing LanguageTool rules.

Overview

nlprule has the following core abstractions:

  • A Tokenizer to split a text into tokens and analyze it by chunking, lemmatizing and part-of-speech tagging. Can also be used independently of the grammatical rules.
  • A Rules structure containing a set of grammatical error correction rules.

Examples

Correct a text:

use nlprule::{Tokenizer, Rules};

let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?;
let rules = Rules::new("path/to/en_rules.bin")?;

assert_eq!(
    rules.correct("She was not been here since Monday.", &tokenizer),
    String::from("She was not here since Monday.")
);

Get suggestions and correct a text:

use nlprule::{Tokenizer, Rules, types::Suggestion, rules::apply_suggestions};

let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?;
let rules = Rules::new("path/to/en_rules.bin")?;

let text = "She was not been here since Monday.";

let suggestions = rules.suggest(text, &tokenizer);
assert_eq!(*suggestions[0].span().char(), 4usize..16);
assert_eq!(suggestions[0].replacements(), vec!["was not", "has not been"]);
assert_eq!(suggestions[0].source(), "GRAMMAR/WAS_BEEN/1");
assert_eq!(suggestions[0].message(), "Did you mean was not or has not been?");

let corrected = apply_suggestions(text, &suggestions);

assert_eq!(corrected, "She was not here since Monday.");

Tokenize & analyze a text:

use nlprule::Tokenizer;

let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?;

let text = "A brief example is shown.";

// returns an iterator over sentences
let sentence = tokenizer.pipe(text).next().expect("`text` contains one sentence.");

println!("{:#?}", sentence);
assert_eq!(sentence.tokens()[1].word().text().as_str(), "brief");
assert_eq!(sentence.tokens()[1].word().tags()[0].pos().as_str(), "JJ");
assert_eq!(sentence.tokens()[1].chunks(), vec!["I-NP-singular"]);
// some other information like char / byte span, lemmas etc. is also set!

Binaries are distributed with Github releases.

Re-exports

pub use rules::Rules;
pub use tokenizer::Tokenizer;

Modules

rule

Implementations related to single rules.

rules

Sets of grammatical error correction rules.

tokenizer

A tokenizer to split raw text into tokens. Tokens are assigned lemmas and part-of-speech tags by lookup from a Tagger and chunks containing information about noun / verb and grammatical case by a statistical Chunker. Tokens are disambiguated (i. e. information from the initial assignment is changed) in a rule-based way by DisambiguationRules.

types

Fundamental types used by this crate.

Macros

rules_filename

Gets the canonical filename for the rules binary for a language code in ISO 639-1 (two-letter) format.

tokenizer_filename

Gets the canonical filename for the tokenizer binary for a language code in ISO 639-1 (two-letter) format.

Enums

Error

Functions

rules_filename

Gets the canonical filename for the rules binary for a language code in ISO 639-1 (two-letter) format.

tokenizer_filename

Gets the canonical filename for the tokenizer binary for a language code in ISO 639-1 (two-letter) format.