Crate sif_embedding

Source
Expand description

§sif-embedding

This crate provides simple but powerful sentence embedding algorithms based on Smooth Inverse Frequency and Common Component Removal described in the following papers:

This library will help you if

  • DNN-based sentence embeddings are too slow for your application,
  • you do not have an option using GPUs, or
  • you want baseline sentence embeddings for your development.

§Getting started

Given models of word embeddings and probabilities, sif-embedding can immediately compute sentence embeddings.

This crate does not have any dependency limitations on using the input models; however, using finalfusion and wordfreq would be the easiest and most reasonable way because these libraries can handle various pre-trained models and are pluged into this crate. See the instructions to install the libraries in this crate.

Sif or USif implements the algorithms of sentence embeddings. SentenceEmbedder defines the behavior of sentence embeddings. The following code shows an example to compute sentence embeddings using finalfusion and wordfreq.

use std::io::BufReader;

use finalfusion::compat::text::ReadText;
use finalfusion::embeddings::Embeddings;
use wordfreq::WordFreq;

use sif_embedding::{Sif, SentenceEmbedder};

// Loads word embeddings from a pretrained model.
let word_embeddings_text = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
let mut reader = BufReader::new(word_embeddings_text.as_bytes());
let word_embeddings = Embeddings::read_text(&mut reader)?;

// Loads word probabilities from a pretrained model.
let word_probs = WordFreq::new([("las", 0.4), ("vegas", 0.6)]);

// Prepares input sentences.
let sentences = ["las vegas", "mega vegas"];

// Fits the model with input sentences.
let model = Sif::new(&word_embeddings, &word_probs);
let model = model.fit(&sentences)?;

// Computes sentence embeddings in shape (n, m),
// where n is the number of sentences and m is the number of dimensions.
let sent_embeddings = model.embeddings(sentences)?;
assert_eq!(sent_embeddings.shape(), &[2, 3]);

model.embeddings requires memory of O(n_sentences * n_dimensions). If your input sentences are too large to fit in memory, you can compute sentence embeddings in a batch manner.

for batch in sentences.chunks(batch_size) {
    let sent_embeddings = model.embeddings(batch)?;
    ...
}

§Feature specifications

This crate provides the following features:

  • Backend features
    • openblas-static (or alias openblas)
    • openblas-system
    • netlib-static (or alias netlib)
    • netlib-system
    • intel-mkl-static (or alias intel-mkl)
    • intel-mkl-system
  • Pre-trained model features
    • finalfusion
    • wordfreq

No feature is enabled by default. The descriptions of the features can be found below.

§Instructions: Backend specifications

This crate depends on ndarray-linalg and allows you to specify any backend supported by ndarray-linalg. You must always specify one of the features from:

  • openblas-static (or alias openblas)
  • openblas-system
  • netlib-static (or alias netlib)
  • netlib-system
  • intel-mkl-static (or alias intel-mkl)
  • intel-mkl-system

The feature names correspond to those of ndarray-linalg (v0.16.0). Refer to the documentation for the specifications.

For example, if you want to use the OpenBLAS backend with static linking, specify the dependencies as follows:

# Cargo.toml

[features]
default = ["openblas-static"]
openblas-static = ["sif-embedding/openblas-static", "openblas-src/static"]

[dependencies.sif-embedding]
version = "0.6"

[dependencies.openblas-src]
version = "0.10.4"
optional = true
default-features = false
features = ["cblas"]

In addition, declare openblas-src at the root of your crate as follows:

// main.rs / lib.rs

#[cfg(feature = "openblas-static")]
extern crate openblas_src as _src;

§Instructions: Pre-trained models

The embedding techniques require two pre-trained models as input:

  • Word embeddings
  • Word probabilities

You can use arbitrary models through the WordEmbeddings and WordProbabilities traits.

This crate already implements these traits for the two external libraries:

  • finalfusion (v0.17): Library to handle different types of word embeddings such as Glove and fastText.
  • wordfreq (v0.2): Library to look up the frequencies of words in many languages.

To enable the features, specify the dependencies as follows:

# Cargo.toml

[dependencies.sif-embedding]
version = "0.6"
features = ["finalfusion", "wordfreq"]

A tutorial to learn how to use external pre-trained models in finalfusion and wordfreq can be found here.

Re-exports§

pub use sif::Sif;
pub use usif::USif;

Modules§

finalfusion
WordEmbeddings implementations for finalfusion::embeddings::Embeddings. This module is available if the finalfusion feature is enabled.
sif
SIF: Smooth Inverse Frequency + Common Component Removal.
usif
uSIF: Unsupervised Smooth Inverse Frequency + Piecewise Common Component Removal.
util
Utilities.
wordfreq
WordProbabilities implementations for wordfreq::WordFreq. This module is available if the wordfreq feature is enabled.

Constants§

DEFAULT_N_SAMPLES_TO_FIT
Default number of samples to fit.
DEFAULT_SEPARATOR
Default separator for splitting sentences into words.

Traits§

SentenceEmbedder
Common behavior of our models for sentence embeddings.
WordEmbeddings
Word embeddings.
WordProbabilities
Word probabilities.

Type Aliases§

Float
Common type of floating numbers.