sif_embedding/lib.rs
1//! # sif-embedding
2//!
3//! This crate provides simple but powerful sentence embedding algorithms based on
4//! *Smooth Inverse Frequency* and *Common Component Removal* described in the following papers:
5//!
6//! - Sanjeev Arora, Yingyu Liang, and Tengyu Ma,
7//! [A Simple but Tough-to-Beat Baseline for Sentence Embeddings](https://openreview.net/forum?id=SyK00v5xx),
8//! ICLR 2017
9//! - Kawin Ethayarajh,
10//! [Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline](https://aclanthology.org/W18-3012/),
11//! RepL4NLP 2018
12//!
13//! This library will help you if
14//!
15//! - DNN-based sentence embeddings are too slow for your application,
16//! - you do not have an option using GPUs, or
17//! - you want baseline sentence embeddings for your development.
18//!
19//! ## Getting started
20//!
21//! Given models of word embeddings and probabilities,
22//! sif-embedding can immediately compute sentence embeddings.
23//!
24//! This crate does not have any dependency limitations on using the input models;
25//! however, using [finalfusion](https://docs.rs/finalfusion/) and [wordfreq](https://docs.rs/wordfreq/)
26//! would be the easiest and most reasonable way
27//! because these libraries can handle various pre-trained models and are pluged into this crate.
28//! See [the instructions](#instructions-pre-trained-models) to install the libraries in this crate.
29//!
30//! [`Sif`] or [`USif`] implements the algorithms of sentence embeddings.
31//! [`SentenceEmbedder`] defines the behavior of sentence embeddings.
32//! The following code shows an example to compute sentence embeddings using finalfusion and wordfreq.
33//!
34//! ```
35//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
36//! use std::io::BufReader;
37//!
38//! use finalfusion::compat::text::ReadText;
39//! use finalfusion::embeddings::Embeddings;
40//! use wordfreq::WordFreq;
41//!
42//! use sif_embedding::{Sif, SentenceEmbedder};
43//!
44//! // Loads word embeddings from a pretrained model.
45//! let word_embeddings_text = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
46//! let mut reader = BufReader::new(word_embeddings_text.as_bytes());
47//! let word_embeddings = Embeddings::read_text(&mut reader)?;
48//!
49//! // Loads word probabilities from a pretrained model.
50//! let word_probs = WordFreq::new([("las", 0.4), ("vegas", 0.6)]);
51//!
52//! // Prepares input sentences.
53//! let sentences = ["las vegas", "mega vegas"];
54//!
55//! // Fits the model with input sentences.
56//! let model = Sif::new(&word_embeddings, &word_probs);
57//! let model = model.fit(&sentences)?;
58//!
59//! // Computes sentence embeddings in shape (n, m),
60//! // where n is the number of sentences and m is the number of dimensions.
61//! let sent_embeddings = model.embeddings(sentences)?;
62//! assert_eq!(sent_embeddings.shape(), &[2, 3]);
63//! # Ok(())
64//! # }
65//! ```
66//!
67//! `model.embeddings` requires memory of `O(n_sentences * n_dimensions)`.
68//! If your input sentences are too large to fit in memory,
69//! you can compute sentence embeddings in a batch manner.
70//!
71//! ```ignore
72//! for batch in sentences.chunks(batch_size) {
73//! let sent_embeddings = model.embeddings(batch)?;
74//! ...
75//! }
76//! ```
77//!
78//! ## Feature specifications
79//!
80//! This crate provides the following features:
81//!
82//! - Backend features
83//! - `openblas-static` (or alias `openblas`)
84//! - `openblas-system`
85//! - `netlib-static` (or alias `netlib`)
86//! - `netlib-system`
87//! - `intel-mkl-static` (or alias `intel-mkl`)
88//! - `intel-mkl-system`
89//! - Pre-trained model features
90//! - `finalfusion`
91//! - `wordfreq`
92//!
93//! No feature is enabled by default.
94//! The descriptions of the features can be found below.
95//!
96//! ## Instructions: Backend specifications
97//!
98//! This crate depends on [ndarray-linalg](https://github.com/rust-ndarray/ndarray-linalg) and
99//! allows you to specify any backend supported by ndarray-linalg.
100//! **You must always specify one of the features** from:
101//!
102//! - `openblas-static` (or alias `openblas`)
103//! - `openblas-system`
104//! - `netlib-static` (or alias `netlib`)
105//! - `netlib-system`
106//! - `intel-mkl-static` (or alias `intel-mkl`)
107//! - `intel-mkl-system`
108//!
109//! The feature names correspond to those of ndarray-linalg (v0.16.0).
110//! Refer to [the documentation](https://github.com/rust-ndarray/ndarray-linalg/tree/ndarray-linalg-v0.16.0)
111//! for the specifications.
112//!
113//! For example, if you want to use the [OpenBLAS](https://www.openblas.net/) backend with static linking,
114//! specify the dependencies as follows:
115//!
116//! ```toml
117//! # Cargo.toml
118//!
119//! [features]
120//! default = ["openblas-static"]
121//! openblas-static = ["sif-embedding/openblas-static", "openblas-src/static"]
122//!
123//! [dependencies.sif-embedding]
124//! version = "0.6"
125//!
126//! [dependencies.openblas-src]
127//! version = "0.10.4"
128//! optional = true
129//! default-features = false
130//! features = ["cblas"]
131//! ```
132//!
133//! In addition, declare `openblas-src` at the root of your crate as follows:
134//!
135//! ```
136//! // main.rs / lib.rs
137//!
138//! #[cfg(feature = "openblas-static")]
139//! extern crate openblas_src as _src;
140//! ```
141//!
142//! ## Instructions: Pre-trained models
143//!
144//! The embedding techniques require two pre-trained models as input:
145//!
146//! - Word embeddings
147//! - Word probabilities
148//!
149//! You can use arbitrary models through the [`WordEmbeddings`] and [`WordProbabilities`] traits.
150//!
151//! This crate already implements these traits for the two external libraries:
152//!
153//! - [finalfusion (v0.17)](https://docs.rs/finalfusion/): Library to handle different types of word embeddings such as Glove and fastText.
154//! - [wordfreq (v0.2)](https://docs.rs/wordfreq/): Library to look up the frequencies of words in many languages.
155//!
156//! To enable the features, specify the dependencies as follows:
157//!
158//! ```toml
159//! # Cargo.toml
160//!
161//! [dependencies.sif-embedding]
162//! version = "0.6"
163//! features = ["finalfusion", "wordfreq"]
164//! ```
165//!
166//! A tutorial to learn how to use external pre-trained models in finalfusion and wordfreq can be found
167//! [here](https://github.com/kampersanda/sif-embedding/tree/main/tutorial).
168#![deny(missing_docs)]
169
170// These declarations are required to recognize the backend.
171// https://github.com/rust-ndarray/ndarray-linalg/blob/ndarray-linalg-v0.16.0/lax/src/lib.rs
172#[cfg(any(feature = "intel-mkl-static", feature = "intel-mkl-system"))]
173extern crate intel_mkl_src as _src;
174#[cfg(any(feature = "netlib-static", feature = "netlib-system"))]
175extern crate netlib_src as _src;
176#[cfg(any(feature = "openblas-static", feature = "openblas-system"))]
177extern crate openblas_src as _src;
178
179pub mod sif;
180pub mod usif;
181pub mod util;
182
183#[cfg(feature = "finalfusion")]
184pub mod finalfusion;
185#[cfg(feature = "wordfreq")]
186pub mod wordfreq;
187
188pub use sif::Sif;
189pub use usif::USif;
190
191use anyhow::Result;
192use ndarray::{Array2, CowArray, Ix1};
193
194/// Common type of floating numbers.
195pub type Float = f32;
196
197/// Default separator for splitting sentences into words.
198pub const DEFAULT_SEPARATOR: char = ' ';
199
200/// Default number of samples to fit.
201pub const DEFAULT_N_SAMPLES_TO_FIT: usize = 1 << 16;
202
203/// Word embeddings.
204pub trait WordEmbeddings {
205 /// Returns the embedding of a word.
206 fn embedding(&self, word: &str) -> Option<CowArray<Float, Ix1>>;
207
208 /// Returns the number of dimension of the word embeddings.
209 fn embedding_size(&self) -> usize;
210}
211
212/// Word probabilities.
213pub trait WordProbabilities {
214 /// Returns the probability of a word.
215 fn probability(&self, word: &str) -> Float;
216
217 /// Returns the number of words in the vocabulary.
218 fn n_words(&self) -> usize;
219
220 /// Returns an iterator over words and probabilities in the vocabulary.
221 fn entries(&self) -> Box<dyn Iterator<Item = (String, Float)> + '_>;
222}
223
224/// Common behavior of our models for sentence embeddings.
225pub trait SentenceEmbedder: Sized {
226 /// Returns the number of dimensions for sentence embeddings.
227 fn embedding_size(&self) -> usize;
228
229 /// Fits the model with input sentences.
230 fn fit<S>(self, sentences: &[S]) -> Result<Self>
231 where
232 S: AsRef<str>;
233
234 /// Computes embeddings for input sentences using the fitted model.
235 fn embeddings<I, S>(&self, sentences: I) -> Result<Array2<Float>>
236 where
237 I: IntoIterator<Item = S>,
238 S: AsRef<str>;
239}