Expand description
This is a simple statistical truecasing library.
Truecasing is restoration of original letter cases in text: for example, turning all-uppercase, or all-lowercase text into one that has proper sentence casing (capital first letter, capitalized names etc).
This crate attempts to solve this problem by gathering statistics from a set of training sentences, then using those statistics to truecase sentences with broken casings. It comes with a command-line utility that makes training a model easy.
§Quick usage example
use truecase::{Model, ModelTrainer};
// build a statistical model from sample sentences
let mut trainer = ModelTrainer::new();
trainer.add_sentence("There are very few writers as good as Shakespeare");
trainer.add_sentence("You and I will have to disagree about this");
trainer.add_sentence("She never came back from USSR");
let model = trainer.into_model();
// use gathered statistics to restore case in caseless text
let truecased_text = model.truecase("i don't think shakespeare was born in ussr");
assert_eq!(truecased_text, "I don't think Shakespeare was born in USSR");
§Building a model a model using the CLI tool
-
Create a file containing training sentences. Each sentence must be on its own line and have proper casing. The bigger the training set, the better and more accurate the model will be.
-
Use
truecase
CLI tool to build a model. This may take some time, depending on the size of the training set. The following command will read training data fromtraining_sentences.txt
file and write the model intomodel.json
file.truecase train -i training_sentences.txt -o model.json
Run
truecase train --help
for more details.
Structs§
- Model
- Truecasing model.
- Model
Trainer - Trainer for new truecasing models.