Crate sif_embedding
source ·Expand description
sif-embedding
This crate provides simple but powerful sentence embedding algorithms based on Smooth Inverse Frequency and Common Component Removal described in the following papers:
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma, A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017
- Kawin Ethayarajh, Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline, RepL4NLP 2018
This library will help you if
- NN-based sentence embeddings are too slow for your application, or
- you do not have an option using GPUs.
Getting started
Given models of word embeddings and probabilities, sif-embedding can immediately compute sentence embeddings.
This crate does not have any dependency limitations on using the input models; however, using finalfusion and wordfreq would be the easiest and most reasonable way because these libraries can handle various pre-trained models and are pluged into this crate. See the instructions to install the libraries in this crate.
Sif
or USif
implements the algorithms of sentence embeddings.
SentenceEmbedder
defines the behavior of sentence embeddings.
The following code shows an example to compute sentence embeddings using finalfusion and wordfreq.
use std::io::BufReader;
use finalfusion::compat::text::ReadText;
use finalfusion::embeddings::Embeddings;
use wordfreq::WordFreq;
use sif_embedding::{Sif, SentenceEmbedder};
// Loads word embeddings from a pretrained model.
let word_embeddings_text = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
let mut reader = BufReader::new(word_embeddings_text.as_bytes());
let word_embeddings = Embeddings::read_text(&mut reader)?;
// Loads word probabilities from a pretrained model.
let word_probs = WordFreq::new([("las", 0.4), ("vegas", 0.6)]);
// Computes sentence embeddings in shape (n, m),
// where n is the number of sentences and m is the number of dimensions.
let model = Sif::new(&word_embeddings, &word_probs);
let (sent_embeddings, model) = model.fit_embeddings(&["las vegas", "mega vegas"])?;
assert_eq!(sent_embeddings.shape(), &[2, 3]);
// Once fitted, the parameters can be used to compute sentence embeddings.
let sent_embeddings = model.embeddings(["vegas pro"])?;
assert_eq!(sent_embeddings.shape(), &[1, 3]);
Instructions: Backend specifications
This crate depends on ndarray-linalg and allows you to specify any backend supported by ndarray-linalg. You must always specify one of the features from:
openblas-static
(or aliasopenblas
)openblas-system
netlib-static
(or aliasnetlib
)netlib-system
intel-mkl-static
(or aliasintel-mkl
)intel-mkl-system
The feature names correspond to those of ndarray-linalg (v0.16.0). Refer to the documentation for the specifications.
For example, if you want to use the OpenBLAS backend with static linking, specify the dependencies as follows:
# Cargo.toml
[features]
default = ["openblas-static"]
openblas-static = ["sif-embedding/openblas-static", "openblas-src/static"]
[dependencies.sif-embedding]
version = "0.5"
[dependencies.openblas-src]
version = "0.10.4"
optional = true
default-features = false
features = ["cblas"]
In addition, declare openblas-src
at the root of your crate as follows:
// main.rs / lib.rs
#[cfg(feature = "openblas-static")]
extern crate openblas_src as _src;
Instructions: Pre-trained models
The embedding techniques require two pre-trained models as input:
- Word embeddings
- Word probabilities
You can use arbitrary models through the WordEmbeddings
and WordProbabilities
traits.
This crate already implements these traits for the two external libraries:
- finalfusion (v0.17): Library to handle different types of word embeddings such as Glove and fastText.
- wordfreq (v0.2): Library to look up the frequencies of words in many languages.
To enable the features, specify the dependencies as follows:
# Cargo.toml
[dependencies.sif-embedding]
version = "0.5"
features = ["finalfusion", "wordfreq"]
A tutorial to learn how to use external pre-trained models in finalfusion and wordfreq can be found here.
Re-exports
Modules
- WordEmbeddings implementations for
finalfusion::embeddings::Embeddings
. - SIF: Smooth Inverse Frequency + Common Component Removal.
- uSIF: Unsupervised Smooth Inverse Frequency + Piecewise Common Component Removal.
- Utilities.
- WordProbabilities implementations for
wordfreq::WordFreq
.
Constants
- Default separator for splitting sentences into words.
Traits
- Common behavior of our models for sentence embeddings.
- Word embeddings.
- Word probabilities.
Type Definitions
- Common type of floating numbers.