Crate nlprule[−][src]
Rule-based grammatical error correction through parsing LanguageTool rules.
Overview
nlprule has the following core abstractions:
- A Tokenizer to split a text into tokens and analyze it by chunking, lemmatizing and part-of-speech tagging. Can also be used independently of the grammatical rules.
- A Rules structure containing a set of grammatical error correction rules.
Examples
Correct a text:
use nlprule::{Tokenizer, Rules}; let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?; let rules = Rules::new("path/to/en_rules.bin")?; assert_eq!( rules.correct("She was not been here since Monday.", &tokenizer), String::from("She was not here since Monday.") );
Get suggestions and correct a text:
use nlprule::{Tokenizer, Rules, types::Suggestion, rules::apply_suggestions}; let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?; let rules = Rules::new("path/to/en_rules.bin")?; let text = "She was not been here since Monday."; let suggestions = rules.suggest(text, &tokenizer); assert_eq!(*suggestions[0].span().char(), 4usize..16); assert_eq!(suggestions[0].replacements(), vec!["was not", "has not been"]); assert_eq!(suggestions[0].source(), "GRAMMAR/WAS_BEEN/1"); assert_eq!(suggestions[0].message(), "Did you mean was not or has not been?"); let corrected = apply_suggestions(text, &suggestions); assert_eq!(corrected, "She was not here since Monday.");
Tokenize & analyze a text:
use nlprule::Tokenizer; let tokenizer = Tokenizer::new("path/to/en_tokenizer.bin")?; let text = "A brief example is shown."; // returns an iterator over sentences let sentence = tokenizer.pipe(text).next().expect("`text` contains one sentence."); println!("{:#?}", sentence); assert_eq!(sentence.tokens()[1].word().text.as_ref(), "brief"); assert_eq!(sentence.tokens()[1].word().tags[0].pos.as_ref(), "JJ"); assert_eq!(sentence.tokens()[1].chunks(), vec!["I-NP-singular"]); // some other information like char / byte span, lemmas etc. is also set!
Binaries are distributed with Github releases.
Re-exports
pub use rules::Rules; |
pub use tokenizer::Tokenizer; |
Modules
rule | Implementations related to single rules. |
rules | Sets of grammatical error correction rules. |
tokenizer | A tokenizer to split raw text into tokens. Tokens are assigned lemmas and part-of-speech tags by lookup from a Tagger and chunks containing information about noun / verb and grammatical case by a statistical Chunker. Tokens are disambiguated (i. e. information from the initial assignment is changed) in a rule-based way by DisambiguationRules. |
types | Fundamental types used by this crate. |
Macros
rules_filename | Gets the canonical filename for the rules binary for a language code in ISO 639-1 (two-letter) format. |
tokenizer_filename | Gets the canonical filename for the tokenizer binary for a language code in ISO 639-1 (two-letter) format. |
Enums
Error |
Functions
rules_filename | Gets the canonical filename for the rules binary for a language code in ISO 639-1 (two-letter) format. |
tokenizer_filename | Gets the canonical filename for the tokenizer binary for a language code in ISO 639-1 (two-letter) format. |