Struct Embedding

Source
pub struct Embedding<S>
where S: VectorSpace,
{ /* private fields */ }
Expand description

§Embeddings

Embeddings are a way to represent the meaning of text in a numerical format. They can be used to compare the meaning of two different texts, search for documents with a embedding database, or train classification models.

§Creating Embeddings

You can create embeddings from text using a Bert embedding model. You can call embed on a Bert instance to get an embedding for a single sentence or embed_batch to get embeddings for a list of sentences at once:

let mut bert = Bert::new().await.unwrap();
let sentences = vec![
    "Kalosm can be used to build local AI applications",
    "With private LLMs data never leaves your computer",
    "The quick brown fox jumps over the lazy dog",
];
let embeddings = bert.embed_batch(&sentences).await.unwrap();

Once you have embeddings, you can compare them to each other with a distance metric. The cosine similarity is a common metric for comparing embeddings that measures the cosine of the angle between the two vectors:

// Find the cosine similarity between each pair of sentences
let n_sentences = sentences.len();
for (i, e_i) in embeddings.iter().enumerate() {
    for j in (i + 1)..n_sentences {
        let e_j = embeddings.get(j).unwrap();
        let cosine_similarity = e_j.cosine_similarity(e_i);
        println!("score: {cosine_similarity:.2} '{}' '{}'", sentences[i], sentences[j])
    }
}

You should see that the first two sentences are similar to each other, while the third sentence not similar to either of the first two:

score: 0.82 'Kalosm can be used to build local AI applications' 'With private LLMs data never leaves your computer'
score: 0.72 'With private LLMs data never leaves your computer' 'The quick brown fox jumps over the lazy dog'
score: 0.72 'Kalosm can be used to build local AI applications' 'The quick brown fox jumps over the lazy dog'

§Searching for Similar Text

Embeddings can also be a powerful tool for search. Unlike traditional text based search, searching for text with embeddings doesn’t directly look for keywords in the text. Instead, it looks for text with similar meanings which can make search more robust and accurate.

In the previous example, we used the cosine similarity to find the similarity between two sentences. Even though the first two sentences have no words in common, their embeddings are similar because they have related meanings.

You can use a vector database to store embedding, value pairs in an easily searchable way. You can create an vector database with VectorDB::new:

// Create a good default Bert model for search
let bert = Bert::new_for_search().await.unwrap();
let sentences = [
    "Kalosm can be used to build local AI applications",
    "With private LLMs data never leaves your computer",
    "The quick brown fox jumps over the lazy dog",
];
// Embed sentences into the vector space
let embeddings = bert.embed_batch(sentences).await.unwrap();
println!("embeddings {:?}", embeddings);

// Create a vector database from the embeddings along with a map between the embedding ids and the sentences
let db = VectorDB::new().unwrap();
let embeddings = db.add_embeddings(embeddings).unwrap();
let embedding_id_to_sentence: HashMap<EmbeddingId, &str> =
    HashMap::from_iter(embeddings.into_iter().zip(sentences));

// Embed a query into the vector space. We use `embed_query` instead of `embed` because some models embed queries differently than normal text.
let embedding = bert.embed_query("What is Kalosm?").await.unwrap();
let closest = db.get_closest(embedding, 1).unwrap();
if let [closest] = closest.as_slice() {
    let distance = closest.distance;
    let text = embedding_id_to_sentence.get(&closest.value).unwrap();
    println!("distance: {distance}");
    println!("closest:  {text}");
}

The vector database should find that the closest sentence to “What is Kalosm?” is “Kalosm can be used to build local AI applications”:

distance: 0.18480265
closest: Kalosm can be used to build local AI applications

§Classification with Embeddings

Since embeddings represent something about the meaning of text, you can use them to quickly train classification models. Instead of training a whole new model to understand text and classify it, you can just train a classifier on top of a frozen embedding model.

Even with a relatively small dataset, a classifier built on top of an embedding model can achieve impressive results. Lets start by creating a dataset of questions and statements:

#[derive(Debug, Clone, Copy, Class)]
enum SentenceType {
    Question,
    Statement,
}
// Create a dataset for the classifier
let bert = Bert::builder()
    .with_source(BertSource::snowflake_arctic_embed_extra_small())
    .build()
    .await?;
let mut dataset = TextClassifierDatasetBuilder::<SentenceType, _>::new(&bert);
const QUESTIONS: [&str; 10] = [
    "What is the capital of France",
    "What is the capital of the United States",
    "What is the best way to learn a new language",
    "What is the best way to learn a new programming language",
    "What is a framework",
    "What is a library",
    "What is a good way to learn a new language",
    "What is a good way to learn a new programming language",
    "What is the city with the most people in the world",
    "What is the most spoken language in the world",
];
const STATEMENTS: [&str; 10] = [
    "The president of France is Emmanuel Macron",
    "The capital of France is Paris",
    "The capital of the United States is Washington, DC",
    "The light bulb was invented by Thomas Edison",
    "The best way to learn a new programming language is to start with the basics and gradually build on them",
    "A framework is a set of libraries and tools that help developers build applications",
    "A library is a collection of code that can be used by other developers",
    "A good way to learn a new language is to practice it every day",
    "The city with the most people in the world is Tokyo",
    "The most spoken language in the United States is English",
];

for question in QUESTIONS {
    dataset.add(question, SentenceType::Question).await?;
}
for statement in STATEMENTS {
    dataset.add(statement, SentenceType::Statement).await?;
}
let dev = accelerated_device_if_available()?;
let dataset = dataset.build(&dev)?;
    // Create a classifier
    let mut classifier = TextClassifier::<SentenceType, BertSpace>::new(Classifier::new(
        &dev,
        ClassifierConfig::new().layers_dims([10]),
    )?);

    // Train the classifier
    classifier.train(
        &dataset, // The dataset to train on
        &dev,     // The device to train on
        100,      // The number of epochs to train for
        0.0003,   // The learning rate
        50,       // The batch size
    )?;

    loop {
        let input = prompt_input("Input: ").unwrap();
        let embedding = bert.embed(input).await?;
        let output = classifier.run(embedding)?;
        println!("Output: {:?}", output);
    }
}

Next, train a classifier on the dataset:

// Create a classifier
let mut classifier = TextClassifier::<SentenceType, BertSpace>::new(Classifier::new(
    &dev,
    ClassifierConfig::new().layers_dims([10]),
)?);

// Train the classifier
classifier.train(
    &dataset, // The dataset to train on
    &dev,     // The device to train on
    100,      // The number of epochs to train for
    0.0003,   // The learning rate
    50,       // The batch size
)?;

// Run the classifier on some input
loop {
    let input = prompt_input("Input: ").unwrap();
    let embedding = bert.embed(input).await?;
    let output = classifier.run(embedding)?;
    println!("Output: {:?}", output);
}

Implementations§

Source§

impl<S> Embedding<S>
where S: VectorSpace,

Source

pub fn cosine_similarity(&self, other: &Embedding<S>) -> f32

Compute the cosine similarity between this embedding and another embedding.

Source§

impl<S1> Embedding<S1>
where S1: VectorSpace,

Source

pub fn cast<S2>(self) -> Embedding<S2>
where S2: VectorSpace,

Cast this embedding to a different vector space.

Source§

impl<S> Embedding<S>
where S: VectorSpace,

Source

pub fn new(embedding: Tensor) -> Embedding<S>

Create a new embedding from a tensor.

Source

pub fn vector(&self) -> &Tensor

Get the tensor that represents this embedding.

Source

pub fn to_vec(&self) -> Vec<f32>

Get the tensor that represents this embedding as a Vec of floats.

Trait Implementations§

Source§

impl<S> Add for Embedding<S>
where S: VectorSpace,

Source§

type Output = Embedding<S>

The resulting type after applying the + operator.
Source§

fn add(self, other: Embedding<S>) -> <Embedding<S> as Add>::Output

Performs the + operation. Read more
Source§

impl<S> Clone for Embedding<S>
where S: VectorSpace,

Source§

fn clone(&self) -> Embedding<S>

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<S> Debug for Embedding<S>
where S: VectorSpace,

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more
Source§

impl<'de, S> Deserialize<'de> for Embedding<S>
where S: VectorSpace,

Source§

fn deserialize<Des>( deserializer: Des, ) -> Result<Embedding<S>, <Des as Deserializer<'de>>::Error>
where Des: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
Source§

impl<S> Div<f64> for Embedding<S>
where S: VectorSpace,

Source§

type Output = Embedding<S>

The resulting type after applying the / operator.
Source§

fn div(self, other: f64) -> <Embedding<S> as Div<f64>>::Output

Performs the / operation. Read more
Source§

impl<S, I> From<I> for Embedding<S>
where S: VectorSpace, I: IntoIterator<Item = f32>,

Source§

fn from(iter: I) -> Embedding<S>

Converts to this type from the input type.
Source§

impl<S> IntoEmbedding<S> for Embedding<S>
where S: VectorSpace,

Convert an embedding of the same vector space into an embedding with an embedding model.

Source§

async fn into_embedding<E>(self, _: &E) -> Result<Embedding<S>, Error>
where E: Embedder<VectorSpace = S>,

Convert the type into an embedding with the given embedding model.
Source§

async fn into_query_embedding<E>(self, _: &E) -> Result<Embedding<S>, Error>
where E: Embedder<VectorSpace = S>,

Convert the type into a query embedding with the given embedding model.
Source§

impl<S> Mul<f64> for Embedding<S>
where S: VectorSpace,

Source§

type Output = Embedding<S>

The resulting type after applying the * operator.
Source§

fn mul(self, other: f64) -> <Embedding<S> as Mul<f64>>::Output

Performs the * operation. Read more
Source§

impl<S> Serialize for Embedding<S>
where S: VectorSpace,

Source§

fn serialize<Ser>( &self, serializer: Ser, ) -> Result<<Ser as Serializer>::Ok, <Ser as Serializer>::Error>
where Ser: Serializer,

Serialize this value into the given Serde serializer. Read more
Source§

impl<S> Sub for Embedding<S>
where S: VectorSpace,

Source§

type Output = Embedding<S>

The resulting type after applying the - operator.
Source§

fn sub(self, other: Embedding<S>) -> <Embedding<S> as Sub>::Output

Performs the - operation. Read more

Auto Trait Implementations§

§

impl<S> Freeze for Embedding<S>

§

impl<S> !RefUnwindSafe for Embedding<S>

§

impl<S> Send for Embedding<S>

§

impl<S> Sync for Embedding<S>

§

impl<S> Unpin for Embedding<S>
where S: Unpin,

§

impl<S> !UnwindSafe for Embedding<S>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,

Source§

impl<T> ErasedDestructor for T
where T: 'static,