Struct Embedding

Source

pub struct Embedding<S>where
    S: VectorSpace,
{ /* private fields */ }

Expand description

§Embeddings

Embeddings are a way to represent the meaning of text in a numerical format. They can be used to compare the meaning of two different texts, search for documents with a embedding database, or train classification models.

§Creating Embeddings

You can create embeddings from text using a Bert embedding model. You can call embed on a Bert instance to get an embedding for a single sentence or embed_batch to get embeddings for a list of sentences at once:

let mut bert = Bert::new().await.unwrap();
let sentences = vec![
    "Kalosm can be used to build local AI applications",
    "With private LLMs data never leaves your computer",
    "The quick brown fox jumps over the lazy dog",
];
let embeddings = bert.embed_batch(&sentences).await.unwrap();

Once you have embeddings, you can compare them to each other with a distance metric. The cosine similarity is a common metric for comparing embeddings that measures the cosine of the angle between the two vectors:

// Find the cosine similarity between each pair of sentences
let n_sentences = sentences.len();
for (i, e_i) in embeddings.iter().enumerate() {
    for j in (i + 1)..n_sentences {
        let e_j = embeddings.get(j).unwrap();
        let cosine_similarity = e_j.cosine_similarity(e_i);
        println!("score: {cosine_similarity:.2} '{}' '{}'", sentences[i], sentences[j])
    }
}

You should see that the first two sentences are similar to each other, while the third sentence not similar to either of the first two:

score: 0.82 'Kalosm can be used to build local AI applications' 'With private LLMs data never leaves your computer'
score: 0.72 'With private LLMs data never leaves your computer' 'The quick brown fox jumps over the lazy dog'
score: 0.72 'Kalosm can be used to build local AI applications' 'The quick brown fox jumps over the lazy dog'

§Searching for Similar Text

Embeddings can also be a powerful tool for search. Unlike traditional text based search, searching for text with embeddings doesn’t directly look for keywords in the text. Instead, it looks for text with similar meanings which can make search more robust and accurate.

In the previous example, we used the cosine similarity to find the similarity between two sentences. Even though the first two sentences have no words in common, their embeddings are similar because they have related meanings.

You can use a vector database to store embedding, value pairs in an easily searchable way. You can create an vector database with VectorDB::new:

// Create a good default Bert model for search
let bert = Bert::new_for_search().await.unwrap();
let sentences = [
    "Kalosm can be used to build local AI applications",
    "With private LLMs data never leaves your computer",
    "The quick brown fox jumps over the lazy dog",
];
// Embed sentences into the vector space
let embeddings = bert.embed_batch(sentences).await.unwrap();
println!("embeddings {:?}", embeddings);

// Create a vector database from the embeddings along with a map between the embedding ids and the sentences
let db = VectorDB::new().unwrap();
let embeddings = db.add_embeddings(embeddings).unwrap();
let embedding_id_to_sentence: HashMap<EmbeddingId, &str> =
    HashMap::from_iter(embeddings.into_iter().zip(sentences));

// Embed a query into the vector space. We use `embed_query` instead of `embed` because some models embed queries differently than normal text.
let embedding = bert.embed_query("What is Kalosm?").await.unwrap();
let closest = db.get_closest(embedding, 1).unwrap();
if let [closest] = closest.as_slice() {
    let distance = closest.distance;
    let text = embedding_id_to_sentence.get(&closest.value).unwrap();
    println!("distance: {distance}");
    println!("closest:  {text}");
}

The vector database should find that the closest sentence to “What is Kalosm?” is “Kalosm can be used to build local AI applications”:

distance: 0.18480265
closest: Kalosm can be used to build local AI applications

§Classification with Embeddings

Since embeddings represent something about the meaning of text, you can use them to quickly train classification models. Instead of training a whole new model to understand text and classify it, you can just train a classifier on top of a frozen embedding model.

Even with a relatively small dataset, a classifier built on top of an embedding model can achieve impressive results. Lets start by creating a dataset of questions and statements:

#[derive(Debug, Clone, Copy, Class)]
enum SentenceType {
    Question,
    Statement,
}
// Create a dataset for the classifier
let bert = Bert::builder()
    .with_source(BertSource::snowflake_arctic_embed_extra_small())
    .build()
    .await?;
let mut dataset = TextClassifierDatasetBuilder::<SentenceType, _>::new(&bert);
const QUESTIONS: [&str; 10] = [
    "What is the capital of France",
    "What is the capital of the United States",
    "What is the best way to learn a new language",
    "What is the best way to learn a new programming language",
    "What is a framework",
    "What is a library",
    "What is a good way to learn a new language",
    "What is a good way to learn a new programming language",
    "What is the city with the most people in the world",
    "What is the most spoken language in the world",
];
const STATEMENTS: [&str; 10] = [
    "The president of France is Emmanuel Macron",
    "The capital of France is Paris",
    "The capital of the United States is Washington, DC",
    "The light bulb was invented by Thomas Edison",
    "The best way to learn a new programming language is to start with the basics and gradually build on them",
    "A framework is a set of libraries and tools that help developers build applications",
    "A library is a collection of code that can be used by other developers",
    "A good way to learn a new language is to practice it every day",
    "The city with the most people in the world is Tokyo",
    "The most spoken language in the United States is English",
];

for question in QUESTIONS {
    dataset.add(question, SentenceType::Question).await?;
}
for statement in STATEMENTS {
    dataset.add(statement, SentenceType::Statement).await?;
}
let dev = accelerated_device_if_available()?;
let dataset = dataset.build(&dev)?;
    // Create a classifier
    let mut classifier = TextClassifier::<SentenceType, BertSpace>::new(Classifier::new(
        &dev,
        ClassifierConfig::new().layers_dims([10]),
    )?);

    // Train the classifier
    classifier.train(
        &dataset, // The dataset to train on
        &dev,     // The device to train on
        100,      // The number of epochs to train for
        0.0003,   // The learning rate
        50,       // The batch size
    )?;

    loop {
        let input = prompt_input("Input: ").unwrap();
        let embedding = bert.embed(input).await?;
        let output = classifier.run(embedding)?;
        println!("Output: {:?}", output);
    }
}

Next, train a classifier on the dataset:

// Create a classifier
let mut classifier = TextClassifier::<SentenceType, BertSpace>::new(Classifier::new(
    &dev,
    ClassifierConfig::new().layers_dims([10]),
)?);

// Train the classifier
classifier.train(
    &dataset, // The dataset to train on
    &dev,     // The device to train on
    100,      // The number of epochs to train for
    0.0003,   // The learning rate
    50,       // The batch size
)?;

// Run the classifier on some input
loop {
    let input = prompt_input("Input: ").unwrap();
    let embedding = bert.embed(input).await?;
    let output = classifier.run(embedding)?;
    println!("Output: {:?}", output);
}

Struct EmbeddingCopy item path

§Embeddings

§Creating Embeddings

§Searching for Similar Text

§Classification with Embeddings

Implementations§

impl<S> Embedding<S>where S: VectorSpace,

pub fn cosine_similarity(&self, other: &Embedding<S>) -> f32

impl<S1> Embedding<S1>where S1: VectorSpace,

pub fn cast<S2>(self) -> Embedding<S2>where S2: VectorSpace,

impl<S> Embedding<S>where S: VectorSpace,

pub fn new(embedding: Tensor) -> Embedding<S>

pub fn vector(&self) -> &Tensor

pub fn to_vec(&self) -> Vec<f32>

Trait Implementations§

impl<S> Add for Embedding<S>where S: VectorSpace,

type Output = Embedding<S>

fn add(self, other: Embedding<S>) -> <Embedding<S> as Add>::Output

impl<S> Clone for Embedding<S>where S: VectorSpace,

fn clone(&self) -> Embedding<S>

fn clone_from(&mut self, source: &Self)

impl<S> Debug for Embedding<S>where S: VectorSpace,

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

impl<'de, S> Deserialize<'de> for Embedding<S>where S: VectorSpace,

fn deserialize<Des>( deserializer: Des, ) -> Result<Embedding<S>, <Des as Deserializer<'de>>::Error>where Des: Deserializer<'de>,

impl<S> Div<f64> for Embedding<S>where S: VectorSpace,

type Output = Embedding<S>

fn div(self, other: f64) -> <Embedding<S> as Div<f64>>::Output

impl<S, I> From<I> for Embedding<S>where S: VectorSpace, I: IntoIterator<Item = f32>,

fn from(iter: I) -> Embedding<S>

impl<S> IntoEmbedding<S> for Embedding<S>where S: VectorSpace,

async fn into_embedding<E>(self, _: &E) -> Result<Embedding<S>, Error>where E: Embedder<VectorSpace = S>,

async fn into_query_embedding<E>(self, _: &E) -> Result<Embedding<S>, Error>where E: Embedder<VectorSpace = S>,

impl<S> Mul<f64> for Embedding<S>where S: VectorSpace,

type Output = Embedding<S>

fn mul(self, other: f64) -> <Embedding<S> as Mul<f64>>::Output

impl<S> Serialize for Embedding<S>where S: VectorSpace,

fn serialize<Ser>( &self, serializer: Ser, ) -> Result<<Ser as Serializer>::Ok, <Ser as Serializer>::Error>where Ser: Serializer,

impl<S> Sub for Embedding<S>where S: VectorSpace,

type Output = Embedding<S>

fn sub(self, other: Embedding<S>) -> <Embedding<S> as Sub>::Output

Auto Trait Implementations§

impl<S> Freeze for Embedding<S>

impl<S> !RefUnwindSafe for Embedding<S>

impl<S> Send for Embedding<S>

impl<S> Sync for Embedding<S>

impl<S> Unpin for Embedding<S>where S: Unpin,

impl<S> !UnwindSafe for Embedding<S>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

Struct Embedding

impl<S> Embedding<S>
where S: VectorSpace,

impl<S1> Embedding<S1>
where S1: VectorSpace,

pub fn cast<S2>(self) -> Embedding<S2>
where S2: VectorSpace,

impl<S> Embedding<S>
where S: VectorSpace,

impl<S> Add for Embedding<S>
where S: VectorSpace,

impl<S> Clone for Embedding<S>
where S: VectorSpace,

impl<S> Debug for Embedding<S>
where S: VectorSpace,

impl<'de, S> Deserialize<'de> for Embedding<S>
where S: VectorSpace,

fn deserialize<Des>( deserializer: Des, ) -> Result<Embedding<S>, <Des as Deserializer<'de>>::Error>
where Des: Deserializer<'de>,

impl<S> Div<f64> for Embedding<S>
where S: VectorSpace,

impl<S, I> From<I> for Embedding<S>
where S: VectorSpace, I: IntoIterator<Item = f32>,

impl<S> IntoEmbedding<S> for Embedding<S>
where S: VectorSpace,

async fn into_embedding<E>(self, _: &E) -> Result<Embedding<S>, Error>
where E: Embedder<VectorSpace = S>,

async fn into_query_embedding<E>(self, _: &E) -> Result<Embedding<S>, Error>
where E: Embedder<VectorSpace = S>,

impl<S> Mul<f64> for Embedding<S>
where S: VectorSpace,

impl<S> Serialize for Embedding<S>
where S: VectorSpace,

fn serialize<Ser>( &self, serializer: Ser, ) -> Result<<Ser as Serializer>::Ok, <Ser as Serializer>::Error>
where Ser: Serializer,

impl<S> Sub for Embedding<S>
where S: VectorSpace,

impl<S> Unpin for Embedding<S>
where S: Unpin,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,

impl<T> ErasedDestructor for T
where T: 'static,