Struct sif_embedding::sif::Sif
source · pub struct Sif<'w, 'u, V, T> { /* private fields */ }
Expand description
An implementation of Smooth Inverse Frequency (SIF) that is a simple but pewerful embedding technique for sentences, described in the paper:
Sanjeev Arora, Yingyu Liang, and Tengyu Ma, A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017.
Examples
use std::io::BufReader;
use finalfusion::compat::text::ReadText;
use finalfusion::embeddings::Embeddings;
use sif_embedding::{Sif, UnigramLM};
// Load word embeddings from a pretrained model.
let word_model = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
let mut reader = BufReader::new(word_model.as_bytes());
let word_embeddings = Embeddings::read_text(&mut reader).unwrap();
// Create a unigram language model.
let word_weights = [("las", 10.), ("vegas", 20.)];
let unigram_lm = UnigramLM::new(word_weights);
// Compute sentence embeddings.
let sif = Sif::new(&word_embeddings, &unigram_lm);
let sent_embeddings = sif.embeddings(["go to las vegas", "mega vegas"]);
assert_eq!(sent_embeddings.shape(), &[2, 3]);
Implementations§
source§impl<'w, 'u, V, T> Sif<'w, 'u, V, T>where
V: Vocab,
T: Storage,
impl<'w, 'u, V, T> Sif<'w, 'u, V, T>where V: Vocab, T: Storage,
sourcepub const fn new(
word_embeddings: &'w Embeddings<V, T>,
unigram_lm: &'u UnigramLM
) -> Self
pub const fn new( word_embeddings: &'w Embeddings<V, T>, unigram_lm: &'u UnigramLM ) -> Self
Creates a new instance.
sourcepub const fn separator(self, separator: char) -> Self
pub const fn separator(self, separator: char) -> Self
Sets a separator for sentence segmentation (default: ASCII whitespace).
sourcepub fn clear_common_component(self) -> Self
pub fn clear_common_component(self) -> Self
Clears the common component retained by Self::embeddings_mut()
.
sourcepub const fn is_common_component_retained(&self) -> bool
pub const fn is_common_component_retained(&self) -> bool
Checks if the common component is retained by Self::embeddings_mut()
.
sourcepub fn embeddings<I, S>(&self, sentences: I) -> Array2<Float>where
I: IntoIterator<Item = S>,
S: AsRef<str>,
pub fn embeddings<I, S>(&self, sentences: I) -> Array2<Float>where I: IntoIterator<Item = S>, S: AsRef<str>,
Computes embeddings for input sentences,
returning a 2D-array of shape (n_sentences, embedding_size)
, where
n_sentences
is the number of input sentences, andembedding_size
isSelf::embedding_size()
.
Behaviors depending on the internal state
The behavior of this method varies depending on the internal state of the instance:
- If the common component
c_0
is retained bySelf::embeddings_mut()
, this method uses it to compute embeddings; - Otherwise, it computes
c_0
from the input sentences and uses it to compute embeddings.
sourcepub fn embeddings_mut<I, S>(&mut self, sentences: I) -> Array2<Float>where
I: IntoIterator<Item = S>,
S: AsRef<str>,
pub fn embeddings_mut<I, S>(&mut self, sentences: I) -> Array2<Float>where I: IntoIterator<Item = S>, S: AsRef<str>,
Computes embeddings for input sentences,
returning a 2D-array of shape (n_sentences, embedding_size)
, where
n_sentences
is the number of input sentences, andembedding_size
isSelf::embedding_size()
.
It also retains the common component c_0
from the input sentences,
allowing for its reuse in Self::embeddings()
.
If the input is empty, the common component will be cleared.
sourcepub fn embedding_size(&self) -> usize
pub fn embedding_size(&self) -> usize
Returns the number of dimensions for sentence embeddings, which is equivalent to that of the input word embeddings.