sbert 0.4.1 - Docs.rs

# Rust SBert [![Latest Version]][crates.io] [![Latest Doc]][docs.rs] ![Build Status]

[Latest Version]: https://img.shields.io/crates/v/sbert.svg
[crates.io]: https://crates.io/crates/sbert
[Latest Doc]: https://docs.rs/sbert/badge.svg
[docs.rs]: https://docs.rs/sbert
[Build Status]: https://travis-ci.com/cpcdoy/rust-sbert.svg?branch=master

Rust port of [sentence-transformers][] using [rust-bert][] and [tch-rs][].

Supports both [rust-tokenizers][] and Hugging Face's [tokenizers][].

## Supported models

- **distiluse-base-multilingual-cased**: Supported languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. Performance on the extended STS2017: 80.1

- **DistilRoBERTa**-based classifiers

## Usage

### Example

The API is made to be very easy to use and enables you to create quality multilingual sentence embeddings in a straightforward way.

Load SBert model with weights by specifying the directory of the model:

```Rust
let mut home: PathBuf = env::current_dir().unwrap();
home.push("path-to-model");
```

You can use different versions of the models that use different tokenizers:

```Rust
// To use Hugging Face tokenizer
let sbert_model = SBertHF::new(home.to_str().unwrap());

// To use Rust-tokenizers
let sbert_model = SBertRT::new(home.to_str().unwrap());
```

Now, you can encode your sentences:

```Rust
let texts = ["You can encode",
             "As many sentences",
             "As you want",
             "Enjoy ;)"];

let batch_size = 64;

let output = sbert_model.forward(texts.to_vec(), batch_size).unwrap();
```

The parameter `batch_size` can be left to `None` to let the model use its default value.

Then you can use the `output` sentence embedding in any application you want.

### Convert models from Python to Rust

Firstly, get a model provided by UKPLabs (all models are [here][models]):

```Bash
mkdir -p models/distiluse-base-multilingual-cased

wget -P models https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distiluse-base-multilingual-cased.zip

unzip models/distiluse-base-multilingual-cased.zip -d models/distiluse-base-multilingual-cased
```

Then, you need to convert the model in a suitable format (requires [pytorch][]):

``` Bash
python utils/prepare_distilbert.py models/distiluse-base-multilingual-cased
```

A dockerized environment is also available for running the conversion script:

```Bash
docker build -t tch-converter -f utils/Dockerfile .

docker run \
  -v $(pwd)/models/distiluse-base-multilingual-cased:/model \
  tch-converter:latest \
  python prepare_distilbert.py /model
```

Finally, set `"output_attentions": true` in `distiluse-base-multilingual-cased/0_distilbert/config.json`.

[sentence-transformers]: https://github.com/UKPLab/sentence-transformers
[rust-bert]: https://github.com/guillaume-be/rust-bert
[tch-rs]: https://github.com/LaurentMazare/tch-rs
[rust-tokenizers]: https://github.com/guillaume-be/rust-tokenizers
[tokenizers]: https://github.com/huggingface/tokenizers/tree/master/tokenizers
[models]: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/
[pytorch]: https://pytorch.org/get-started/locally