pub struct Embedding<S>where
S: VectorSpace,{ /* private fields */ }
Expand description
§Embeddings
Embeddings are a way to represent the meaning of text in a numerical format. They can be used to compare the meaning of two different texts, search for documents with a embedding database, or train classification models.
§Creating Embeddings
You can create embeddings from text using a Bert
embedding model. You can call embed
on a Bert
instance to get an embedding for a single sentence or embed_batch
to get embeddings for a list of sentences at once:
let mut bert = Bert::new().await.unwrap();
let sentences = vec![
"Kalosm can be used to build local AI applications",
"With private LLMs data never leaves your computer",
"The quick brown fox jumps over the lazy dog",
];
let embeddings = bert.embed_batch(&sentences).await.unwrap();
Once you have embeddings, you can compare them to each other with a distance metric. The cosine similarity is a common metric for comparing embeddings that measures the cosine of the angle between the two vectors:
// Find the cosine similarity between each pair of sentences
let n_sentences = sentences.len();
for (i, e_i) in embeddings.iter().enumerate() {
for j in (i + 1)..n_sentences {
let e_j = embeddings.get(j).unwrap();
let cosine_similarity = e_j.cosine_similarity(e_i);
println!("score: {cosine_similarity:.2} '{}' '{}'", sentences[i], sentences[j])
}
}
You should see that the first two sentences are similar to each other, while the third sentence not similar to either of the first two:
score: 0.82 'Kalosm can be used to build local AI applications' 'With private LLMs data never leaves your computer'
score: 0.72 'With private LLMs data never leaves your computer' 'The quick brown fox jumps over the lazy dog'
score: 0.72 'Kalosm can be used to build local AI applications' 'The quick brown fox jumps over the lazy dog'
§Searching for Similar Text
Embeddings can also be a powerful tool for search. Unlike traditional text based search, searching for text with embeddings doesn’t directly look for keywords in the text. Instead, it looks for text with similar meanings which can make search more robust and accurate.
In the previous example, we used the cosine similarity to find the similarity between two sentences. Even though the first two sentences have no words in common, their embeddings are similar because they have related meanings.
You can use a vector database to store embedding, value pairs in an easily searchable way. You can create an vector database with VectorDB::new
:
// Create a good default Bert model for search
let bert = Bert::new_for_search().await.unwrap();
let sentences = [
"Kalosm can be used to build local AI applications",
"With private LLMs data never leaves your computer",
"The quick brown fox jumps over the lazy dog",
];
// Embed sentences into the vector space
let embeddings = bert.embed_batch(sentences).await.unwrap();
println!("embeddings {:?}", embeddings);
// Create a vector database from the embeddings along with a map between the embedding ids and the sentences
let db = VectorDB::new().unwrap();
let embeddings = db.add_embeddings(embeddings).unwrap();
let embedding_id_to_sentence: HashMap<EmbeddingId, &str> =
HashMap::from_iter(embeddings.into_iter().zip(sentences));
// Embed a query into the vector space. We use `embed_query` instead of `embed` because some models embed queries differently than normal text.
let embedding = bert.embed_query("What is Kalosm?").await.unwrap();
let closest = db.get_closest(embedding, 1).unwrap();
if let [closest] = closest.as_slice() {
let distance = closest.distance;
let text = embedding_id_to_sentence.get(&closest.value).unwrap();
println!("distance: {distance}");
println!("closest: {text}");
}
The vector database should find that the closest sentence to “What is Kalosm?” is “Kalosm can be used to build local AI applications”:
distance: 0.18480265
closest: Kalosm can be used to build local AI applications
§Classification with Embeddings
Since embeddings represent something about the meaning of text, you can use them to quickly train classification models. Instead of training a whole new model to understand text and classify it, you can just train a classifier on top of a frozen embedding model.
Even with a relatively small dataset, a classifier built on top of an embedding model can achieve impressive results. Lets start by creating a dataset of questions and statements:
#[derive(Debug, Clone, Copy, Class)]
enum SentenceType {
Question,
Statement,
}
// Create a dataset for the classifier
let bert = Bert::builder()
.with_source(BertSource::snowflake_arctic_embed_extra_small())
.build()
.await?;
let mut dataset = TextClassifierDatasetBuilder::<SentenceType, _>::new(&bert);
const QUESTIONS: [&str; 10] = [
"What is the capital of France",
"What is the capital of the United States",
"What is the best way to learn a new language",
"What is the best way to learn a new programming language",
"What is a framework",
"What is a library",
"What is a good way to learn a new language",
"What is a good way to learn a new programming language",
"What is the city with the most people in the world",
"What is the most spoken language in the world",
];
const STATEMENTS: [&str; 10] = [
"The president of France is Emmanuel Macron",
"The capital of France is Paris",
"The capital of the United States is Washington, DC",
"The light bulb was invented by Thomas Edison",
"The best way to learn a new programming language is to start with the basics and gradually build on them",
"A framework is a set of libraries and tools that help developers build applications",
"A library is a collection of code that can be used by other developers",
"A good way to learn a new language is to practice it every day",
"The city with the most people in the world is Tokyo",
"The most spoken language in the United States is English",
];
for question in QUESTIONS {
dataset.add(question, SentenceType::Question).await?;
}
for statement in STATEMENTS {
dataset.add(statement, SentenceType::Statement).await?;
}
let dev = accelerated_device_if_available()?;
let dataset = dataset.build(&dev)?;
// Create a classifier
let mut classifier = TextClassifier::<SentenceType, BertSpace>::new(Classifier::new(
&dev,
ClassifierConfig::new().layers_dims([10]),
)?);
// Train the classifier
classifier.train(
&dataset, // The dataset to train on
&dev, // The device to train on
100, // The number of epochs to train for
0.0003, // The learning rate
50, // The batch size
)?;
loop {
let input = prompt_input("Input: ").unwrap();
let embedding = bert.embed(input).await?;
let output = classifier.run(embedding)?;
println!("Output: {:?}", output);
}
}
Next, train a classifier on the dataset:
// Create a classifier
let mut classifier = TextClassifier::<SentenceType, BertSpace>::new(Classifier::new(
&dev,
ClassifierConfig::new().layers_dims([10]),
)?);
// Train the classifier
classifier.train(
&dataset, // The dataset to train on
&dev, // The device to train on
100, // The number of epochs to train for
0.0003, // The learning rate
50, // The batch size
)?;
// Run the classifier on some input
loop {
let input = prompt_input("Input: ").unwrap();
let embedding = bert.embed(input).await?;
let output = classifier.run(embedding)?;
println!("Output: {:?}", output);
}
Implementations§
Source§impl<S> Embedding<S>where
S: VectorSpace,
impl<S> Embedding<S>where
S: VectorSpace,
Sourcepub fn cosine_similarity(&self, other: &Embedding<S>) -> f32
pub fn cosine_similarity(&self, other: &Embedding<S>) -> f32
Compute the cosine similarity between this embedding and another embedding.
Source§impl<S1> Embedding<S1>where
S1: VectorSpace,
impl<S1> Embedding<S1>where
S1: VectorSpace,
Sourcepub fn cast<S2>(self) -> Embedding<S2>where
S2: VectorSpace,
pub fn cast<S2>(self) -> Embedding<S2>where
S2: VectorSpace,
Cast this embedding to a different vector space.
Source§impl<S> Embedding<S>where
S: VectorSpace,
impl<S> Embedding<S>where
S: VectorSpace,
Trait Implementations§
Source§impl<S> Add for Embedding<S>where
S: VectorSpace,
impl<S> Add for Embedding<S>where
S: VectorSpace,
Source§impl<S> Clone for Embedding<S>where
S: VectorSpace,
impl<S> Clone for Embedding<S>where
S: VectorSpace,
Source§impl<S> Debug for Embedding<S>where
S: VectorSpace,
impl<S> Debug for Embedding<S>where
S: VectorSpace,
Source§impl<'de, S> Deserialize<'de> for Embedding<S>where
S: VectorSpace,
impl<'de, S> Deserialize<'de> for Embedding<S>where
S: VectorSpace,
Source§fn deserialize<Des>(
deserializer: Des,
) -> Result<Embedding<S>, <Des as Deserializer<'de>>::Error>where
Des: Deserializer<'de>,
fn deserialize<Des>(
deserializer: Des,
) -> Result<Embedding<S>, <Des as Deserializer<'de>>::Error>where
Des: Deserializer<'de>,
Source§impl<S> IntoEmbedding<S> for Embedding<S>where
S: VectorSpace,
Convert an embedding of the same vector space into an embedding with an embedding model.
impl<S> IntoEmbedding<S> for Embedding<S>where
S: VectorSpace,
Convert an embedding of the same vector space into an embedding with an embedding model.
Source§impl<S> Serialize for Embedding<S>where
S: VectorSpace,
impl<S> Serialize for Embedding<S>where
S: VectorSpace,
Source§fn serialize<Ser>(
&self,
serializer: Ser,
) -> Result<<Ser as Serializer>::Ok, <Ser as Serializer>::Error>where
Ser: Serializer,
fn serialize<Ser>(
&self,
serializer: Ser,
) -> Result<<Ser as Serializer>::Ok, <Ser as Serializer>::Error>where
Ser: Serializer,
Auto Trait Implementations§
impl<S> Freeze for Embedding<S>
impl<S> !RefUnwindSafe for Embedding<S>
impl<S> Send for Embedding<S>
impl<S> Sync for Embedding<S>
impl<S> Unpin for Embedding<S>where
S: Unpin,
impl<S> !UnwindSafe for Embedding<S>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more