Crate nlprule[][src]

Rule-based grammatical error correction through parsing LanguageTool rules.

Overview

NLPRule has the following core abstractions:

  • A Tokenizer to split a text into tokens and analyze it by chunking, lemmatizing and part-of-speech tagging. Can also be used independently of the grammatical rules.
  • A Rules structure containing a set of grammatical error correction rules.

Example: correct a text

use nlprule::{Tokenizer, Rules};

let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?;
let rules = Rules::new("path/to/en_rules.bin")?;

assert_eq!(
    rules.correct("She was not been here since Monday.", &tokenizer),
    String::from("She was not here since Monday.")
);

Example: get suggestions and correct a text

use nlprule::{Tokenizer, Rules, types::Suggestion, rules::apply_suggestions};

let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?;
let rules = Rules::new("path/to/en_rules.bin")?;

let text = "She was not been here since Monday.";

let suggestions = rules.suggest(text, &tokenizer);
assert_eq!(
    suggestions,
    vec![Suggestion {
        start: 4, // these are character indices!
        end: 16,
        replacements: vec!["was not".into(), "has not been".into()],
        source: "WAS_BEEN.1".into(),
        message: "Did you mean was not or has not been?".into()
    }]
);

let corrected = apply_suggestions(text, &suggestions);

assert_eq!(corrected, "She was not here since Monday.");

Binaries are distributed with Github releases.

The 't lifetime

By convention the lifetime 't in this crate is the lifetime of the input text. Almost all structures with a lifetime are bound to this lifetime.

Re-exports

pub use rules::Rules;
pub use tokenizer::Tokenizer;

Modules

rule

Implementations related to single rules.

rules

Sets of grammatical error correction rules.

tokenizer

A tokenizer to split raw text into tokens. Tokens are assigned lemmas and part-of-speech tags by lookup from a Tagger and chunks containing information about noun / verb and grammatical case by a statistical Chunker. Tokens are disambiguated (i. e. information from the initial assignment is changed) in a rule-based way by DisambiguationRules.

types

Fundamental types used by this crate.

Macros

rules_filename

Gets the canonical filename for the rules binary for a language code in ISO 639-1 (two-letter) format.

tokenizer_filename

Gets the canonical filename for the tokenizer binary for a language code in ISO 639-1 (two-letter) format.

Enums

Error

Functions

rules_filename

Gets the canonical filename for the rules binary for a language code in ISO 639-1 (two-letter) format.

tokenizer_filename

Gets the canonical filename for the tokenizer binary for a language code in ISO 639-1 (two-letter) format.