1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
//! # sif-embedding
//!
//! This crate provides simple but powerful sentence embedding algorithms based on
//! *Smooth Inverse Frequency* and *Common Component Removal* described in the following papers:
//!
//! - Sanjeev Arora, Yingyu Liang, and Tengyu Ma,
//! [A Simple but Tough-to-Beat Baseline for Sentence Embeddings](https://openreview.net/forum?id=SyK00v5xx),
//! ICLR 2017
//! - Kawin Ethayarajh,
//! [Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline](https://aclanthology.org/W18-3012/),
//! RepL4NLP 2018
//!
//! This library will help you if
//!
//! - DNN-based sentence embeddings are too slow for your application,
//! - you do not have an option using GPUs, or
//! - you want baseline sentence embeddings for your development.
//!
//! ## Getting started
//!
//! Given models of word embeddings and probabilities,
//! sif-embedding can immediately compute sentence embeddings.
//!
//! This crate does not have any dependency limitations on using the input models;
//! however, using [finalfusion](https://docs.rs/finalfusion/) and [wordfreq](https://docs.rs/wordfreq/)
//! would be the easiest and most reasonable way
//! because these libraries can handle various pre-trained models and are pluged into this crate.
//! See [the instructions](#instructions-pre-trained-models) to install the libraries in this crate.
//!
//! [`Sif`] or [`USif`] implements the algorithms of sentence embeddings.
//! [`SentenceEmbedder`] defines the behavior of sentence embeddings.
//! The following code shows an example to compute sentence embeddings using finalfusion and wordfreq.
//!
//! ```
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! use std::io::BufReader;
//!
//! use finalfusion::compat::text::ReadText;
//! use finalfusion::embeddings::Embeddings;
//! use wordfreq::WordFreq;
//!
//! use sif_embedding::{Sif, SentenceEmbedder};
//!
//! // Loads word embeddings from a pretrained model.
//! let word_embeddings_text = "las 0.0 1.0 2.0\nvegas -3.0 -4.0 -5.0\n";
//! let mut reader = BufReader::new(word_embeddings_text.as_bytes());
//! let word_embeddings = Embeddings::read_text(&mut reader)?;
//!
//! // Loads word probabilities from a pretrained model.
//! let word_probs = WordFreq::new([("las", 0.4), ("vegas", 0.6)]);
//!
//! // Prepares input sentences.
//! let sentences = ["las vegas", "mega vegas"];
//!
//! // Fits the model with input sentences.
//! let model = Sif::new(&word_embeddings, &word_probs);
//! let model = model.fit(&sentences)?;
//!
//! // Computes sentence embeddings in shape (n, m),
//! // where n is the number of sentences and m is the number of dimensions.
//! let sent_embeddings = model.embeddings(sentences)?;
//! assert_eq!(sent_embeddings.shape(), &[2, 3]);
//! # Ok(())
//! # }
//! ```
//!
//! `model.embeddings` requires memory of `O(n_sentences * n_dimensions)`.
//! If your input sentences are too large to fit in memory,
//! you can compute sentence embeddings in a batch manner.
//!
//! ```ignore
//! for batch in sentences.chunks(batch_size) {
//! let sent_embeddings = model.embeddings(batch)?;
//! ...
//! }
//! ```
//!
//! ## Feature specifications
//!
//! This crate provides the following features:
//!
//! - Backend features
//! - `openblas-static` (or alias `openblas`)
//! - `openblas-system`
//! - `netlib-static` (or alias `netlib`)
//! - `netlib-system`
//! - `intel-mkl-static` (or alias `intel-mkl`)
//! - `intel-mkl-system`
//! - Pre-trained model features
//! - `finalfusion`
//! - `wordfreq`
//!
//! No feature is enabled by default.
//! The descriptions of the features can be found below.
//!
//! ## Instructions: Backend specifications
//!
//! This crate depends on [ndarray-linalg](https://github.com/rust-ndarray/ndarray-linalg) and
//! allows you to specify any backend supported by ndarray-linalg.
//! **You must always specify one of the features** from:
//!
//! - `openblas-static` (or alias `openblas`)
//! - `openblas-system`
//! - `netlib-static` (or alias `netlib`)
//! - `netlib-system`
//! - `intel-mkl-static` (or alias `intel-mkl`)
//! - `intel-mkl-system`
//!
//! The feature names correspond to those of ndarray-linalg (v0.16.0).
//! Refer to [the documentation](https://github.com/rust-ndarray/ndarray-linalg/tree/ndarray-linalg-v0.16.0)
//! for the specifications.
//!
//! For example, if you want to use the [OpenBLAS](https://www.openblas.net/) backend with static linking,
//! specify the dependencies as follows:
//!
//! ```toml
//! # Cargo.toml
//!
//! [features]
//! default = ["openblas-static"]
//! openblas-static = ["sif-embedding/openblas-static", "openblas-src/static"]
//!
//! [dependencies.sif-embedding]
//! version = "0.6"
//!
//! [dependencies.openblas-src]
//! version = "0.10.4"
//! optional = true
//! default-features = false
//! features = ["cblas"]
//! ```
//!
//! In addition, declare `openblas-src` at the root of your crate as follows:
//!
//! ```
//! // main.rs / lib.rs
//!
//! #[cfg(feature = "openblas-static")]
//! extern crate openblas_src as _src;
//! ```
//!
//! ## Instructions: Pre-trained models
//!
//! The embedding techniques require two pre-trained models as input:
//!
//! - Word embeddings
//! - Word probabilities
//!
//! You can use arbitrary models through the [`WordEmbeddings`] and [`WordProbabilities`] traits.
//!
//! This crate already implements these traits for the two external libraries:
//!
//! - [finalfusion (v0.17)](https://docs.rs/finalfusion/): Library to handle different types of word embeddings such as Glove and fastText.
//! - [wordfreq (v0.2)](https://docs.rs/wordfreq/): Library to look up the frequencies of words in many languages.
//!
//! To enable the features, specify the dependencies as follows:
//!
//! ```toml
//! # Cargo.toml
//!
//! [dependencies.sif-embedding]
//! version = "0.6"
//! features = ["finalfusion", "wordfreq"]
//! ```
//!
//! A tutorial to learn how to use external pre-trained models in finalfusion and wordfreq can be found
//! [here](https://github.com/kampersanda/sif-embedding/tree/main/tutorial).
#![deny(missing_docs)]
// These declarations are required to recognize the backend.
// https://github.com/rust-ndarray/ndarray-linalg/blob/ndarray-linalg-v0.16.0/lax/src/lib.rs
#[cfg(any(feature = "intel-mkl-static", feature = "intel-mkl-system"))]
extern crate intel_mkl_src as _src;
#[cfg(any(feature = "netlib-static", feature = "netlib-system"))]
extern crate netlib_src as _src;
#[cfg(any(feature = "openblas-static", feature = "openblas-system"))]
extern crate openblas_src as _src;
pub mod sif;
pub mod usif;
pub mod util;
#[cfg(feature = "finalfusion")]
pub mod finalfusion;
#[cfg(feature = "wordfreq")]
pub mod wordfreq;
pub use sif::Sif;
pub use usif::USif;
use anyhow::Result;
use ndarray::{Array2, CowArray, Ix1};
/// Common type of floating numbers.
pub type Float = f32;
/// Default separator for splitting sentences into words.
pub const DEFAULT_SEPARATOR: char = ' ';
/// Default number of samples to fit.
pub const DEFAULT_N_SAMPLES_TO_FIT: usize = 1 << 16;
/// Word embeddings.
pub trait WordEmbeddings {
/// Returns the embedding of a word.
fn embedding(&self, word: &str) -> Option<CowArray<Float, Ix1>>;
/// Returns the number of dimension of the word embeddings.
fn embedding_size(&self) -> usize;
}
/// Word probabilities.
pub trait WordProbabilities {
/// Returns the probability of a word.
fn probability(&self, word: &str) -> Float;
/// Returns the number of words in the vocabulary.
fn n_words(&self) -> usize;
/// Returns an iterator over words and probabilities in the vocabulary.
fn entries(&self) -> Box<dyn Iterator<Item = (String, Float)> + '_>;
}
/// Common behavior of our models for sentence embeddings.
pub trait SentenceEmbedder: Sized {
/// Returns the number of dimensions for sentence embeddings.
fn embedding_size(&self) -> usize;
/// Fits the model with input sentences.
fn fit<S>(self, sentences: &[S]) -> Result<Self>
where
S: AsRef<str>;
/// Computes embeddings for input sentences using the fitted model.
fn embeddings<I, S>(&self, sentences: I) -> Result<Array2<Float>>
where
I: IntoIterator<Item = S>,
S: AsRef<str>;
}