rust-bert
Rust native Transformer-based models implementation. Port of Huggingface's Transformers library, using the tch-rs crate and pre-processing from rust-tokenizers. Supports multithreaded tokenization and GPU inference. This repository exposes the model base architecture, task-specific heads (see below) and ready-to-use pipelines.
The following models are currently implemented:
DistilBERT | BERT | RoBERTa | GPT | GPT2 | BART | Electra | |
---|---|---|---|---|---|---|---|
Masked LM | ✅ | ✅ | ✅ | ✅ | |||
Sequence classification | ✅ | ✅ | ✅ | ||||
Token classification | ✅ | ✅ | ✅ | ✅ | |||
Question answering | ✅ | ✅ | ✅ | ||||
Multiple choices | ✅ | ✅ | |||||
Next token prediction | ✅ | ✅ | ✅ | ||||
Natural Language Generation | ✅ | ✅ | ✅ | ||||
Summarization | ✅ |
Ready-to-use pipelines
Based on Huggingface's pipelines, ready to use end-to-end NLP pipelines are available as part of this crate. The following capabilities are currently available:
1. Question Answering
Extractive question answering from a given question and context. DistilBERT model finetuned on SQuAD (Stanford Question Answering Dataset)
let qa_model = new?;
let question = String from;
let context = String from;
let answers = qa_model.predict;
Output:
[Answer { score: 0.9976814985275269, start: 13, end: 21, answer: "Amsterdam" }]
2. Summarization
Abstractive summarization using a pretrained BART model.
let summarization_model = new?;
let input = ;
let output = summarization_model.summarize;
(example from: WikiNews)
Output:
"Scientists have found water vapour on K2-18b, a planet 110 light-years from Earth.
This is the first such discovery in a planet in its star's habitable zone.
The planet is not too hot and not too cold for liquid water to exist."
3. Natural Language Generation
Generate language based on a prompt. GPT2 and GPT available as base models. Include techniques such as beam search, top-k and nucleus sampling, temperature setting and repetition penalty. Supports batch generation of sentences from several prompts. Sequences will be left-padded with the model's padding token if present, the unknown token otherwise. This may impact the results and it is recommended to submit prompts of similar length for best results
let model = new?;
let input_context_1 = "The dog";
let input_context_2 = "The cat was";
let output = model.generate;
Example output:
[
"The dog's owners, however, did not want to be named. According to the lawsuit, the animal's owner, a 29-year"
"The dog has always been part of the family. \"He was always going to be my dog and he was always looking out for me"
"The dog has been able to stay in the home for more than three months now. \"It's a very good dog. She's"
"The cat was discovered earlier this month in the home of a relative of the deceased. The cat\'s owner, who wished to remain anonymous,"
"The cat was pulled from the street by two-year-old Jazmine.\"I didn't know what to do,\" she said"
"The cat was attacked by two stray dogs and was taken to a hospital. Two other cats were also injured in the attack and are being treated."
]
4. Sentiment analysis
Predicts the binary sentiment for a sentence. DistilBERT model finetuned on SST-2.
let sentiment_classifier = new?;
let input = ;
let output = sentiment_classifier.predict;
(Example courtesy of IMDb)
Output:
[
Sentiment { polarity: Positive, score: 0.9981985493795946 },
Sentiment { polarity: Negative, score: 0.9927982091903687 },
Sentiment { polarity: Positive, score: 0.9997248985164333 }
]
5. Named Entity Recognition
Extracts entities (Person, Location, Organization, Miscellaneous) from text. BERT cased large model finetuned on CoNNL03, contributed by the MDZ Digital Library team at the Bavarian State Library
let ner_model = new?;
let input = ;
let output = ner_model.predict;
Output:
[
Entity { word: "Amy", score: 0.9986, label: "I-PER" }
Entity { word: "Paris", score: 0.9985, label: "I-LOC" }
Entity { word: "Paris", score: 0.9988, label: "I-LOC" }
Entity { word: "France", score: 0.9993, label: "I-LOC" }
]
Base models
The base model and task-specific heads are also available for users looking to expose their own transformer based models.
Examples on how to prepare the date using a native tokenizers Rust library are available in ./examples
for BERT, DistilBERT, RoBERTa, GPT, GPT2 and BART.
Note that when importing models from Pytorch, the convention for parameters naming needs to be aligned with the Rust schema. Loading of the pre-trained weights will fail if any of the model parameters weights cannot be found in the weight files.
If this quality check is to be skipped, an alternative method load_partial
can be invoked from the variables store.
Setup
A number of pretrained model configuration, weights and vocabulary are downloaded directly from Huggingface's model repository.
The list of models available with Rust-compatible weights is available at https://huggingface.co/models?filter=rust.
The models will be downloaded to the environment variable RUSTBERT_CACHE
if it exists, otherwise to ~/.cache/.rustbert
.
Additional models can be added if of interest, please raise an issue.
In order to load custom weights to the library, these need to be converter to a binary format that can be read by Libtorch (the original .bin
files are pickles and cannot be used directly).
Several Python scripts to load Pytorch weights and convert them to the appropriate format are provided and can be adapted based on the model needs.
- Compile the package:
cargo build
- Download the model files & perform necessary conversions
- Set-up a virtual environment and install dependencies
- run the conversion script
python /utils/download-dependencies_{MODEL_TO_DOWNLOAD}.py
. The dependencies will be downloaded to the user's home directory, under~/rustbert/{}
. Alternatively you may load local weight files and run the conversion directly.
Acknowledgements
Thank you to Hugging Face for hosting a set of weights compatible with this Rust library. The list of ready-to-use pretrained models is listed at https://huggingface.co/models?filter=rust.