Expand description
§sif-embedding
This crate provides simple but powerful sentence embedding algorithms based on Smooth Inverse Frequency and Common Component Removal described in the following papers:
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma, A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017
- Kawin Ethayarajh, Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline, RepL4NLP 2018
This library will help you if
- DNN-based sentence embeddings are too slow for your application,
- you do not have an option using GPUs, or
- you want baseline sentence embeddings for your development.
§Getting started
Given models of word embeddings and probabilities, sif-embedding can immediately compute sentence embeddings.
This crate does not have any dependency limitations on using the input models; however, using finalfusion and wordfreq would be the easiest and most reasonable way because these libraries can handle various pre-trained models and are pluged into this crate. See the instructions to install the libraries in this crate.
Sif
or USif
implements the algorithms of sentence embeddings.
SentenceEmbedder
defines the behavior of sentence embeddings.
The following code shows an example to compute sentence embeddings using finalfusion and wordfreq.
use std::io::BufReader;
use finalfusion::compat::text::ReadText;
use finalfusion::embeddings::Embeddings;
use wordfreq::WordFreq;
use sif_embedding::{Sif, SentenceEmbedder};
// Loads word embeddings from a pretrained model.
let word_embeddings_text = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
let mut reader = BufReader::new(word_embeddings_text.as_bytes());
let word_embeddings = Embeddings::read_text(&mut reader)?;
// Loads word probabilities from a pretrained model.
let word_probs = WordFreq::new([("las", 0.4), ("vegas", 0.6)]);
// Prepares input sentences.
let sentences = ["las vegas", "mega vegas"];
// Fits the model with input sentences.
let model = Sif::new(&word_embeddings, &word_probs);
let model = model.fit(&sentences)?;
// Computes sentence embeddings in shape (n, m),
// where n is the number of sentences and m is the number of dimensions.
let sent_embeddings = model.embeddings(sentences)?;
assert_eq!(sent_embeddings.shape(), &[2, 3]);
model.embeddings
requires memory of O(n_sentences * n_dimensions)
.
If your input sentences are too large to fit in memory,
you can compute sentence embeddings in a batch manner.
for batch in sentences.chunks(batch_size) {
let sent_embeddings = model.embeddings(batch)?;
...
}
§Feature specifications
This crate provides the following features:
- Backend features
openblas-static
(or aliasopenblas
)openblas-system
netlib-static
(or aliasnetlib
)netlib-system
intel-mkl-static
(or aliasintel-mkl
)intel-mkl-system
- Pre-trained model features
finalfusion
wordfreq
No feature is enabled by default. The descriptions of the features can be found below.
§Instructions: Backend specifications
This crate depends on ndarray-linalg and allows you to specify any backend supported by ndarray-linalg. You must always specify one of the features from:
openblas-static
(or aliasopenblas
)openblas-system
netlib-static
(or aliasnetlib
)netlib-system
intel-mkl-static
(or aliasintel-mkl
)intel-mkl-system
The feature names correspond to those of ndarray-linalg (v0.16.0). Refer to the documentation for the specifications.
For example, if you want to use the OpenBLAS backend with static linking, specify the dependencies as follows:
# Cargo.toml
[features]
default = ["openblas-static"]
openblas-static = ["sif-embedding/openblas-static", "openblas-src/static"]
[dependencies.sif-embedding]
version = "0.6"
[dependencies.openblas-src]
version = "0.10.4"
optional = true
default-features = false
features = ["cblas"]
In addition, declare openblas-src
at the root of your crate as follows:
// main.rs / lib.rs
#[cfg(feature = "openblas-static")]
extern crate openblas_src as _src;
§Instructions: Pre-trained models
The embedding techniques require two pre-trained models as input:
- Word embeddings
- Word probabilities
You can use arbitrary models through the WordEmbeddings
and WordProbabilities
traits.
This crate already implements these traits for the two external libraries:
- finalfusion (v0.17): Library to handle different types of word embeddings such as Glove and fastText.
- wordfreq (v0.2): Library to look up the frequencies of words in many languages.
To enable the features, specify the dependencies as follows:
# Cargo.toml
[dependencies.sif-embedding]
version = "0.6"
features = ["finalfusion", "wordfreq"]
A tutorial to learn how to use external pre-trained models in finalfusion and wordfreq can be found here.
Re-exports§
Modules§
- finalfusion
- WordEmbeddings implementations for
finalfusion::embeddings::Embeddings
. This module is available if thefinalfusion
feature is enabled. - sif
- SIF: Smooth Inverse Frequency + Common Component Removal.
- usif
- uSIF: Unsupervised Smooth Inverse Frequency + Piecewise Common Component Removal.
- util
- Utilities.
- wordfreq
- WordProbabilities implementations for
wordfreq::WordFreq
. This module is available if thewordfreq
feature is enabled.
Constants§
- DEFAULT_
N_ SAMPLES_ TO_ FIT - Default number of samples to fit.
- DEFAULT_
SEPARATOR - Default separator for splitting sentences into words.
Traits§
- Sentence
Embedder - Common behavior of our models for sentence embeddings.
- Word
Embeddings - Word embeddings.
- Word
Probabilities - Word probabilities.
Type Aliases§
- Float
- Common type of floating numbers.