glowrs 0.1.2

SentenceTransformers for candle-rs
Documentation

glowrs

Library Usage

The glowrs library provides an easy and familiar interface to use pre-trained models for embeddings and sentence similarity.

Example

use glowrs::SentenceTransformer;

fn main() {
    let encoder = SentenceTransformer::from_repo_string("sentence-transformers/all-MiniLM-L6-v2").unwrap();

    let sentences = vec![
        "Hello, how are you?",
        "Hey, how are you doing?"
    ];

    let embeddings = encoder.encode_batch(sentences, true).unwrap();

    println!("{:?}", embeddings);
}

Features

  • Load models from Hugging Face Hub
  • More to come!

Server Usage

glowrs also provides a web server for sentence embedding inference. Uses candle as Tensor framework. It currently supports Bert type models hosted on Huggingface, such as those provided by sentence-transformers, Tom Aarsen, or Jina AI, as long as they provide safetensors model weights.

Example usage with the jina-embeddings-v2-base-en model:

cargo run --bin server --release -- --model-repo jinaai/jina-embeddings-v2-base-en

If you want to use a certain revision of the model, you can append it to the repository name like so.

cargo run --bin server --release -- --model-repo jinaai/jina-embeddings-v2-base-en:main

The SentenceTransformer will attempt to infer the model type from the model name. If it fails, you can specify the model type like so:

cargo run --bin server --release -- --model-repo jinaai/jina-embeddings-v2-base-en:main:bert

Currently bert and jinabert are supported.

If you want to run multiple models, you can run multiple instances of the server with different model repos.

cargo run --bin server --release -- --model-repo jinaai/jina-embeddings-v2-base-en sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Warning: This is not supported with metal acceleration for now.

Instructions:

Usage: server [OPTIONS]

Options:
  -m, --model-repo <MODEL_REPO>  
  -r, --revision <REVISION>      [default: main]
  -h, --help                     Print help

Build features

  • metal: Compile with Metal acceleration
  • cuda: Compile with CUDA acceleration
  • accelerate: Compile with Accelerate framework acceleration (CPU)

Features

  • OpenAI API compatible (/v1/embeddings) REST API endpoint
  • candle inference for bert and jina-bert models
  • Hardware acceleration (Metal for now)
  • Queueing
  • Multiple models
  • Batching
  • Performance metrics

curl

curl -X POST http://localhost:3000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["The food was delicious and the waiter...", "was too"], 
    "model": "jina-embeddings-v2-base-en",
    "encoding_format": "float"
  }'

Python openai client

Install the OpenAI Python library:

pip install openai

Use the embeddings method regularly.

from openai import OpenAI
from time import time

client = OpenAI(
	api_key="sk-something",
	base_url="http://127.0.0.1:3000/v1"
)

start = time()
print(client.embeddings.create(
	input=["This is a sentence that requires an embedding"] * 50,
	model="jinaai/jina-embeddings-v2-base-en"
))

print(f"Done in {time() - start}")

# List models
print(client.models.list())

Details

  • Use TOKIO_WORKER_THREADS to set the number of threads per queue.

Disclaimer

This is still a work-in-progress. The embedding performance is decent but can probably do with some benchmarking. Furthermore, for higher batch sizes, the program is killed due to a bug.

Do not use this in a production environment.