# Rust SBert [![Latest Version]][crates.io] [![Latest Doc]][docs.rs] ![Build Status]
[Latest Version]: https://img.shields.io/crates/v/sbert.svg
[crates.io]: https://crates.io/crates/sbert
[Latest Doc]: https://docs.rs/sbert/badge.svg
[docs.rs]: https://docs.rs/sbert
[Build Status]: https://travis-ci.com/cpcdoy/rust-sbert.svg?branch=master
Rust port of [sentence-transformers][] using [rust-bert][] and [tch-rs][].
Supports both [rust-tokenizers][] and Hugging Face's [tokenizers][].
## Supported models
- **distiluse-base-multilingual-cased**: Supported languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. Performance on the extended STS2017: 80.1
- **DistilRoBERTa**-based classifiers
## Usage
### Example
The API is made to be very easy to use and enables you to create quality multilingual sentence embeddings in a straightforward way.
Load SBert model with weights by specifying the directory of the model:
```Rust
let mut home: PathBuf = env::current_dir().unwrap();
home.push("path-to-model");
```
You can use different versions of the models that use different tokenizers:
```Rust
// To use Hugging Face tokenizer
let sbert_model = SBertHF::new(home.to_str().unwrap());
// To use Rust-tokenizers
let sbert_model = SBertRT::new(home.to_str().unwrap());
```
Now, you can encode your sentences:
```Rust
let texts = ["You can encode",
"As many sentences",
"As you want",
"Enjoy ;)"];
let batch_size = 64;
let output = sbert_model.forward(texts.to_vec(), batch_size).unwrap();
```
The parameter `batch_size` can be left to `None` to let the model use its default value.
Then you can use the `output` sentence embedding in any application you want.
### Convert models from Python to Rust
Firstly, get a model provided by UKPLabs (all models are [here][models]):
```Bash
mkdir -p models/distiluse-base-multilingual-cased
wget -P models https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distiluse-base-multilingual-cased.zip
unzip models/distiluse-base-multilingual-cased.zip -d models/distiluse-base-multilingual-cased
```
Then, you need to convert the model in a suitable format (requires [pytorch][]):
``` Bash
python utils/prepare_distilbert.py models/distiluse-base-multilingual-cased
```
A dockerized environment is also available for running the conversion script:
```Bash
docker build -t tch-converter -f utils/Dockerfile .
docker run \
-v $(pwd)/models/distiluse-base-multilingual-cased:/model \
tch-converter:latest \
python prepare_distilbert.py /model
```
Finally, set `"output_attentions": true` in `distiluse-base-multilingual-cased/0_distilbert/config.json`.
[sentence-transformers]: https://github.com/UKPLab/sentence-transformers
[rust-bert]: https://github.com/guillaume-be/rust-bert
[tch-rs]: https://github.com/LaurentMazare/tch-rs
[rust-tokenizers]: https://github.com/guillaume-be/rust-tokenizers
[tokenizers]: https://github.com/huggingface/tokenizers/tree/master/tokenizers
[models]: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/
[pytorch]: https://pytorch.org/get-started/locally