Struct Embedding

Source

pub struct Embedding { /* private fields */ }

Expand description

§Embeddings

Embeddings are a way to represent the meaning of text in a numerical format. They can be used to compare the meaning of two different texts, search for documents with a embedding database, or train classification models.

§Creating Embeddings

You can create embeddings from text using a Bert embedding model. You can call embed on a Bert instance to get an embedding for a single sentence or embed_batch to get embeddings for a list of sentences at once:

let mut bert = Bert::new().await.unwrap();
let sentences = vec![
    "Kalosm can be used to build local AI applications",
    "With private LLMs data never leaves your computer",
    "The quick brown fox jumps over the lazy dog",
];
let embeddings = bert.embed_batch(&sentences).await.unwrap();

Once you have embeddings, you can compare them to each other with a distance metric. The cosine similarity is a common metric for comparing embeddings that measures the cosine of the angle between the two vectors:

// Find the cosine similarity between each pair of sentences
let n_sentences = sentences.len();
for (i, e_i) in embeddings.iter().enumerate() {
    for j in (i + 1)..n_sentences {
        let e_j = embeddings.get(j).unwrap();
        let cosine_similarity = e_j.cosine_similarity(e_i);
        println!("score: {cosine_similarity:.2} '{}' '{}'", sentences[i], sentences[j])
    }
}

You should see that the first two sentences are similar to each other, while the third sentence not similar to either of the first two:

score: 0.82 'Kalosm can be used to build local AI applications' 'With private LLMs data never leaves your computer'
score: 0.72 'With private LLMs data never leaves your computer' 'The quick brown fox jumps over the lazy dog'
score: 0.72 'Kalosm can be used to build local AI applications' 'The quick brown fox jumps over the lazy dog'

§Searching for Similar Text

Embeddings can also be a powerful tool for search. Unlike traditional text based search, searching for text with embeddings doesn’t directly look for keywords in the text. Instead, it looks for text with similar meanings which can make search more robust and accurate.

In the previous example, we used the cosine similarity to find the similarity between two sentences. Even though the first two sentences have no words in common, their embeddings are similar because they have related meanings.

You can use a vector database to store embedding, value pairs in an easily searchable way. You can create an vector database with VectorDB::new:

// Create a good default Bert model for search
let bert = Bert::new_for_search().await?;
let sentences = [
    "Kalosm can be used to build local AI applications",
    "With private LLMs data never leaves your computer",
    "The quick brown fox jumps over the lazy dog",
];
// Embed sentences into the vector space
let embeddings = bert.embed_batch(sentences).await?;
println!("embeddings {:?}", embeddings);

// Create a vector database from the embeddings along with a map between the embedding ids and the sentences
let db = VectorDB::new()?;
let embeddings = db.add_embeddings(embeddings)?;
let embedding_id_to_sentence: HashMap<EmbeddingId, &str> =
    HashMap::from_iter(embeddings.into_iter().zip(sentences));

// Embed a query into the vector space. We use `embed_query` instead of `embed` because some models embed queries differently than normal text.
let embedding = bert.embed_query("What is Kalosm?").await?;
let closest = db.search(&embedding).run()?;
if let [closest] = closest.as_slice() {
    let distance = closest.distance;
    let text = embedding_id_to_sentence.get(&closest.value).unwrap();
    println!("distance: {distance}");
    println!("closest:  {text}");
}

The vector database should find that the closest sentence to “What is Kalosm?” is “Kalosm can be used to build local AI applications”:

distance: 0.18480265
closest: Kalosm can be used to build local AI applications

§Classification with Embeddings

Since embeddings represent something about the meaning of text, you can use them to quickly train classification models. Instead of training a whole new model to understand text and classify it, you can just train a classifier on top of a frozen embedding model.

Even with a relatively small dataset, a classifier built on top of an embedding model can achieve impressive results. Lets start by creating a dataset of questions and statements:

#[derive(Debug, Clone, Copy, Class)]
enum SentenceType {
    Question,
    Statement,
}
// Create a dataset for the classifier
let bert = Bert::builder()
    .with_source(BertSource::snowflake_arctic_embed_extra_small())
    .build()
    .await?;
let mut dataset = TextClassifierDatasetBuilder::<SentenceType, _>::new(&bert);
const QUESTIONS: [&str; 10] = [
    "What is the capital of France",
    "What is the capital of the United States",
    "What is the best way to learn a new language",
    "What is the best way to learn a new programming language",
    "What is a framework",
    "What is a library",
    "What is a good way to learn a new language",
    "What is a good way to learn a new programming language",
    "What is the city with the most people in the world",
    "What is the most spoken language in the world",
];
const STATEMENTS: [&str; 10] = [
    "The president of France is Emmanuel Macron",
    "The capital of France is Paris",
    "The capital of the United States is Washington, DC",
    "The light bulb was invented by Thomas Edison",
    "The best way to learn a new programming language is to start with the basics and gradually build on them",
    "A framework is a set of libraries and tools that help developers build applications",
    "A library is a collection of code that can be used by other developers",
    "A good way to learn a new language is to practice it every day",
    "The city with the most people in the world is Tokyo",
    "The most spoken language in the United States is English",
];

for question in QUESTIONS {
    dataset.add(question, SentenceType::Question).await?;
}
for statement in STATEMENTS {
    dataset.add(statement, SentenceType::Statement).await?;
}
let dev = accelerated_device_if_available()?;
let dataset = dataset.build(&dev)?;
    // Create a classifier
    let classifier = TextClassifier::<SentenceType>::new(Classifier::new(
        &dev,
        ClassifierConfig::new().layers_dims([10]),
    )?);

    // Train the classifier
    classifier.train(
        &dataset, // The dataset to train on
        100,      // The number of epochs to train for
        0.0003,   // The learning rate
        50,       // The batch size
        |_| {},   // The callback to run as the model trains
    )?;

    loop {
        let input = prompt_input("Input: ").unwrap();
        let embedding = bert.embed(input).await?;
        let output = classifier.run(embedding)?;
        println!("Output: {:?}", output);
    }
}

Next, train a classifier on the dataset:

// Create a classifier
let classifier = TextClassifier::<SentenceType>::new(Classifier::new(
    &dev,
    ClassifierConfig::new().layers_dims([10]),
)?);

// Train the classifier
classifier.train(
    &dataset, // The dataset to train on
    100,      // The number of epochs to train for
    0.0003,   // The learning rate
    50,       // The batch size
    |_| {},   // The callback to run as the model trains
)?;

// Run the classifier on some input
loop {
    let input = prompt_input("Input: ").unwrap();
    let embedding = bert.embed(input).await?;
    let output = classifier.run(embedding)?;
    println!("Output: {:?}", output);
}

Struct EmbeddingCopy item path

§Embeddings

§Creating Embeddings

§Searching for Similar Text

§Classification with Embeddings

Implementations§

impl Embedding

pub fn cosine_similarity(&self, other: &Self) -> f32

impl Embedding

pub fn new(embedding: Box<[f32]>) -> Self

pub fn vector(&self) -> &[f32]

Trait Implementations§

impl Add for Embedding

type Output = Embedding

fn add(self, other: Self) -> Self::Output

impl Clone for Embedding

fn clone(&self) -> Self

const fn clone_from(&mut self, source: &Self)

impl Debug for Embedding

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<'de> Deserialize<'de> for Embedding

fn deserialize<Des: Deserializer<'de>>( deserializer: Des, ) -> Result<Self, Des::Error>

impl Div<f32> for Embedding

type Output = Embedding

fn div(self, other: f32) -> Self::Output

impl<I: IntoIterator<Item = f32>> From<I> for Embedding

fn from(iter: I) -> Self

impl IntoEmbedding for Embedding

async fn into_embedding<E: Embedder>(self, _: &E) -> Result<Embedding, E::Error>

async fn into_query_embedding<E: Embedder>( self, _: &E, ) -> Result<Embedding, E::Error>

impl Mul<f32> for Embedding

type Output = Embedding

fn mul(self, other: f32) -> Self::Output

impl Serialize for Embedding

fn serialize<Ser: Serializer>( &self, serializer: Ser, ) -> Result<Ser::Ok, Ser::Error>

impl Sub for Embedding

type Output = Embedding

fn sub(self, other: Self) -> Self::Output

Auto Trait Implementations§

impl Freeze for Embedding

impl RefUnwindSafe for Embedding

impl Send for Embedding

impl Sync for Embedding

impl Unpin for Embedding

impl UnwindSafe for Embedding

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

impl<T> DeserializeOwned for Twhere T: for<'de> Deserialize<'de>,

impl<T> ErasedDestructor for Twhere T: 'static,

Struct Embedding

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,

impl<T> ErasedDestructor for T
where T: 'static,