vv-llm-rs
Universal LLM client layer for Rust. One typed API for chat, streaming, embeddings, rerank, multimodal messages, tool calls, and vendor endpoint resolution.
[]
= { = "crates/vv-llm" }
The crate is published as vv-llm; Rust code imports it as vv_llm.
Supported Backends
OpenAI-compatible chat works with OpenAI, DeepSeek, Qwen, Gemini OpenAI-compatible endpoints, ZhiPuAI, Groq, Mistral, Moonshot, MiniMax, Yi, Baichuan, StepFun, xAI, Ernie, local OpenAI-compatible servers, and similar /v1/chat/completions APIs.
Native transports are also available for:
- Anthropic Messages API
- Anthropic on AWS Bedrock through Bedrock Converse
- OpenAI-compatible models on Google Vertex AI with automatic Google access-token exchange
- OpenAI-compatible embedding APIs
- JSON HTTP rerank APIs such as SiliconFlow rerank
Quick Start
Direct Client
use ;
async
Settings-Based Client
Use LlmSettings when models and endpoints should come from a shared configuration file.
use ;
async
Minimal settings shape:
Endpoint bindings may be strings or objects. Object bindings can override the provider model id and can be disabled:
Streaming
create_stream returns normalized ChatStreamDelta values. Text deltas, tool-call deltas, usage, completion state, and supported reasoning deltas use the same Rust type across providers.
use StreamExt;
use ;
let mut stream = client
.create_stream
.await?;
while let Some = stream.next.await
OpenAI-compatible streams normalize content, tool calls, usage chunks, and tagged reasoning such as <think>...</think> or Gemini <thought>...</thought>. Anthropic Bedrock streams normalize text, tool use, reasoning, and usage events. The direct Anthropic SDK path currently exposes text streaming only because the upstream Rust crate does not expose tool/thinking stream request fields.
Tool Calls
use ;
let request = ChatRequest ;
let response = client.create_completion.await?;
for call in response.tool_calls
Tool-result turns use MessageRole::Tool with tool_call_id, and assistant tool-call turns use Message.tool_calls.
Multimodal Input
Text and image parts can be mixed in a user message. Image URLs should be data URLs for providers that require inline base64 payloads.
use ;
let message = Message ;
Embeddings And Rerank
use ;
let embedding_client = create_embedding_client;
let embeddings = embedding_client
.create_embeddings
.await?;
println!;
let rerank_client = new;
let rerank = rerank_client
.rerank
.await?;
println!;
Vertex AI And Bedrock
Vertex OpenAI-compatible endpoints are configured with endpoint_type: "openai_vertex" and Google credentials. User refresh-token credentials and service-account credentials are supported.
Anthropic Bedrock endpoints are configured with endpoint_type: "anthropic_bedrock", AWS region, and AWS credentials.
Features
- Unified chat API — one
ChatClienttrait for completions and streaming - Settings resolution — load model catalogs, endpoint bindings, provider ids, and transport metadata from JSON
- OpenAI-compatible adapters — chat and embeddings through
async-openai - Anthropic support — direct Messages API plus Bedrock Converse transport
- Streaming normalization — provider stream events become
ChatStreamDelta - Tool calling — normalized function/tool definitions, assistant tool calls, and tool-result turns
- Multimodal messages — text and image parts for supported providers
- Vertex authentication — Google access-token exchange with in-process cache
- Retrieval clients — OpenAI-compatible embeddings and custom JSON rerank
- Token counting — tiktoken-based counts for GPT-3.5, GPT-4o, o1, and o3 families with deterministic fallback
- Typed errors — configuration, provider, HTTP, serialization, model, and endpoint errors
Utilities
use ;
| Function | Description |
|---|---|
normalize_text_messages |
Merge adjacent same-role text messages without merging images or tool data |
count_tokens |
Count tokens with supported model tokenizers |
count_tokens_fallback |
Deterministic whitespace fallback counter |
RetryPolicy |
Small retry metadata helper for callers that manage retries externally |
Project Structure
vv-llm-rs/
Cargo.toml
crates/vv-llm/
src/
chat_clients/ # Chat clients, stream normalization, Vertex auth
embedding_clients/ # OpenAI-compatible embedding client
rerank_clients/ # Custom JSON HTTP rerank client
settings.rs # Settings parsing and model resolution
types.rs # Public request/response/error types
utilities/ # Message normalization, token counting, retry metadata
tests/
fixtures/ # Sample settings and live-test assets
Development
Run checks from the workspace root:
Live integration tests are ignored by default. Put real credentials in crates/vv-llm/tests/fixtures/dev_settings.json, or set VV_LLM_SETTINGS_JSON, then run:
VV_LLM_RUN_LIVE_TESTS=1
Engineering documentation lives in docs/. Start there for architecture notes, provider adapter behavior, live-test policy, security rules, and maintenance workflows.
Releases are published to crates.io by the tag workflow documented in docs/RELEASE.md.
License
MIT