Crate sif_embedding

source ·
Expand description

sif-embedding

This crate provides simple but powerful sentence embedding algorithms based on Smooth Inverse Frequency and Common Component Removal described in the following papers:

This library will help you if

  • NN-based sentence embeddings are too slow for your application, or
  • you do not have an option using GPUs.

Getting started

Given models of word embeddings and probabilities, sif-embedding can immediately compute sentence embeddings.

This crate does not have any dependency limitations on using the input models; however, using finalfusion and wordfreq would be the easiest and most reasonable way because these libraries can handle various pre-trained models and are pluged into this crate. See the instructions to install the libraries in this crate.

Sif or USif implements the algorithms of sentence embeddings. SentenceEmbedder defines the behavior of sentence embeddings. The following code shows an example to compute sentence embeddings using finalfusion and wordfreq.

use std::io::BufReader;

use finalfusion::compat::text::ReadText;
use finalfusion::embeddings::Embeddings;
use wordfreq::WordFreq;

use sif_embedding::{Sif, SentenceEmbedder};

// Loads word embeddings from a pretrained model.
let word_embeddings_text = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
let mut reader = BufReader::new(word_embeddings_text.as_bytes());
let word_embeddings = Embeddings::read_text(&mut reader)?;

// Loads word probabilities from a pretrained model.
let word_probs = WordFreq::new([("las", 0.4), ("vegas", 0.6)]);

// Computes sentence embeddings in shape (n, m),
// where n is the number of sentences and m is the number of dimensions.
let model = Sif::new(&word_embeddings, &word_probs);
let (sent_embeddings, model) = model.fit_embeddings(&["las vegas", "mega vegas"])?;
assert_eq!(sent_embeddings.shape(), &[2, 3]);

// Once fitted, the parameters can be used to compute sentence embeddings.
let sent_embeddings = model.embeddings(["vegas pro"])?;
assert_eq!(sent_embeddings.shape(), &[1, 3]);

Instructions: Backend specifications

This crate depends on ndarray-linalg and allows you to specify any backend supported by ndarray-linalg. You must always specify one of the features from:

  • openblas-static (or alias openblas)
  • openblas-system
  • netlib-static (or alias netlib)
  • netlib-system
  • intel-mkl-static (or alias intel-mkl)
  • intel-mkl-system

The feature names correspond to those of ndarray-linalg (v0.16.0). Refer to the documentation for the specifications.

For example, if you want to use the OpenBLAS backend with static linking, specify the dependencies as follows:

# Cargo.toml

[features]
default = ["openblas-static"]
openblas-static = ["sif-embedding/openblas-static", "openblas-src/static"]

[dependencies.sif-embedding]
version = "0.5"

[dependencies.openblas-src]
version = "0.10.4"
optional = true
default-features = false
features = ["cblas"]

In addition, declare openblas-src at the root of your crate as follows:

// main.rs / lib.rs

#[cfg(feature = "openblas-static")]
extern crate openblas_src as _src;

Instructions: Pre-trained models

The embedding techniques require two pre-trained models as input:

  • Word embeddings
  • Word probabilities

You can use arbitrary models through the WordEmbeddings and WordProbabilities traits.

This crate already implements these traits for the two external libraries:

  • finalfusion (v0.17): Library to handle different types of word embeddings such as Glove and fastText.
  • wordfreq (v0.2): Library to look up the frequencies of words in many languages.

To enable the features, specify the dependencies as follows:

# Cargo.toml

[dependencies.sif-embedding]
version = "0.5"
features = ["finalfusion", "wordfreq"]

A tutorial to learn how to use external pre-trained models in finalfusion and wordfreq can be found here.

Re-exports

  • pub use sif::Sif;
  • pub use usif::USif;

Modules

Constants

Traits

Type Definitions

  • Common type of floating numbers.