vllora_llm 0.1.18

LLM client layer for the Vllora AI Gateway: unified chat-completions over multiple providers (OpenAI, Anthropic, Gemini, Bedrock, LangDB proxy) with optional tracing/telemetry.
docs.rs failed to build vllora_llm-0.1.18
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Vllora LLM crate (vllora_llm)

Crates.io

This crate powers the Vllora AI Gateway’s LLM layer. It provides:

  • Unified chat-completions client over multiple providers (OpenAI-compatible, Anthropic, Gemini, Bedrock, …)
  • Gateway-native types (ChatCompletionRequest, ChatCompletionMessage, routing & tools support)
  • Streaming responses and telemetry hooks via a common ModelInstance trait
  • Tracing integration: out-of-the-box tracing support, with a console example in llm/examples/tracing (spans/events to stdout) and an OTLP example in llm/examples/tracing_otlp (send spans to external collectors such as New Relic)
  • Supported parameters: See Supported parameters for a detailed table of which parameters are honored by each provider

Use it when you want to talk to the gateway’s LLM engine from Rust code, without worrying about provider-specific SDKs.


Installation

Run cargo add vllora_llm or add to your Cargo.toml:

[dependencies]
vllora_llm = "0.1"

Quick start

Here's a minimal example to get started:

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    // 1) Build a chat completion request using gateway-native types
    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(),
        messages: vec![
            ChatCompletionMessage::new_text(
                "system".to_string(),
                "You are a helpful assistant.".to_string(),
            ),
            ChatCompletionMessage::new_text(
                "user".to_string(),
                "Stream numbers 1 to 20 in separate lines.".to_string(),
            ),
        ],
        ..Default::default()
    };

    // 2) Construct a VlloraLLMClient
    let client = VlloraLLMClient::new();

    // 3) Non-streaming: send the request and print the final reply
    let response = client
        .completions()
        .create(request.clone())
        .await?;
    
    // ... handle response
    Ok(())
}

Note: By default, VlloraLLMClient::new() fetches API keys from environment variables following the pattern VLLORA_{PROVIDER_NAME}_API_KEY. For example, for OpenAI, it will look for VLLORA_OPENAI_API_KEY.


Quick start with async-openai-compatible types

If you already build OpenAI-compatible requests (e.g. via async-openai-compat), you can send both non‑streaming and streaming completions through VlloraLLMClient.

use async_openai::types::{
    ChatCompletionRequestMessage,
    ChatCompletionRequestSystemMessageArgs,
    ChatCompletionRequestUserMessageArgs,
    CreateChatCompletionRequestArgs,
};
use tokio_stream::StreamExt;

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::error::LLMResult;
use vllora_llm::types::credentials::{ApiKeyCredentials, Credentials};

#[tokio::main]
async fn main() -> LLMResult<()> {
    // 1) Build an OpenAI-style request using async-openai-compatible types
    let openai_req = CreateChatCompletionRequestArgs::default()
        .model("gpt-4.1-mini")
        .messages([
            ChatCompletionRequestMessage::System(
                ChatCompletionRequestSystemMessageArgs::default()
                    .content("You are a helpful assistant.")
                    .build()?,
            ),
            ChatCompletionRequestMessage::User(
                ChatCompletionRequestUserMessageArgs::default()
                    .content("Stream numbers 1 to 20 in separate lines.")
                    .build()?,
            ),
        ])
        .build()?;

    // 2) Construct a VlloraLLMClient (here: direct OpenAI key)
    let client = VlloraLLMClient::new().with_credentials(Credentials::ApiKey(
        ApiKeyCredentials {
            api_key: std::env::var("VLLORA_OPENAI_API_KEY")
                .expect("VLLORA_OPENAI_API_KEY must be set"),
        },
    ));

    // 3) Non-streaming: send the request and print the final reply
    let response = client
        .completions()
        .create(openai_req.clone())
        .await?;

    if let Some(content) = &response.message().content {
        if let Some(text) = content.as_string() {
            println!("Non-streaming reply:\\n{text}");
        }
    }

    // 4) Streaming: send the same request and print chunks as they arrive
    let mut stream = client
        .completions()
        .create_stream(openai_req)
        .await?;

    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        for choice in chunk.choices {
            if let Some(delta) = choice.delta.content {
                print!("{delta}");
            }
        }
    }

    Ok(())
}

Basic usage: completions client (gateway-native)

The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.

use std::sync::Arc;

use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    // In production you would pass a real ModelInstance implementation
    // that knows how to call your configured providers / router.
    let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));

    // Build the high-level client
    let client = VlloraLLMClient::new_with_instance(instance);

    // Build a simple chat completion request
    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
        messages: vec![
            ChatCompletionMessage::new_text(
                "system".to_string(),
                "You are a helpful assistant.".to_string(),
            ),
            ChatCompletionMessage::new_text(
                "user".to_string(),
                "Say hello in one short sentence.".to_string(),
            ),
        ],
        ..Default::default()
    };

    // Send the request and get a single response message
    let response = client.completions().create(request).await?;

    let message = response.message();
    if let Some(content) = &message.content {
        if let Some(text) = content.as_string() {
            println!("Model reply: {text}");
        }
    }

    Ok(())
}

Key pieces:

  • VlloraLLMClient: wraps a ModelInstance and exposes .completions().
  • CompletionsClient::create: sends a one-shot completion request and returns a ChatCompletionMessageWithFinishReason.
  • Gateway types (ChatCompletionRequest, ChatCompletionMessage) abstract over provider-specific formats.


Streaming completions

CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:

use std::sync::Arc;

use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
    let client = VlloraLLMClient::new_with_instance(instance);

    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(),
        messages: vec![ChatCompletionMessage::new_text(
            "user".to_string(),
            "Stream the alphabet, one chunk at a time.".to_string(),
        )],
        ..Default::default()
    };

    let mut stream = client.completions().create_stream(request).await?;

    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        for choice in chunk.choices {
            if let Some(delta) = choice.delta.content {
                print!("{delta}");
            }
        }
    }

    Ok(())
}

The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.


Supported parameters

The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:

Parameter OpenAI / Proxy Anthropic Gemini Bedrock Notes
model yes yes yes yes Taken from ChatCompletionRequest.model or engine config.
max_tokens yes yes yes yes Mapped to provider-specific max_tokens / max_output_tokens.
temperature yes yes yes yes Sampling temperature.
top_p yes yes yes yes Nucleus sampling.
n no no yes no For Gemini, mapped to candidate_count; other providers always use n = 1.
stop / stop_sequences yes yes yes yes Converted to each provider’s stop / stop-sequences field.
presence_penalty yes no yes no OpenAI / Gemini only.
frequency_penalty yes no yes no OpenAI / Gemini only.
logit_bias yes no no no OpenAI-only token bias map.
user yes no no no OpenAI “end-user id” field.
seed yes no yes no Deterministic sampling where supported.
response_format (JSON schema, etc.) yes no yes no Gemini additionally normalizes JSON schema for its API.
prompt_cache_key yes no no no OpenAI-only prompt caching hint.
provider_specific.top_k no yes no no Anthropic-only: maps to Claude top_k.
provider_specific.thinking no yes no no Anthropic “thinking” options (e.g. budget tokens).
Bedrock additional_parameters map no no no yes Free-form JSON, passed through to Bedrock model params.

Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic’s ephemeral cache-control settings.

All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.

Provider-specific examples

There are runnable examples under llm/examples/ that mirror the patterns above:

  • openai: Direct OpenAI chat completions using VlloraLLMClient (non-streaming + streaming).
  • anthropic: Anthropic (Claude) chat completions via the unified client.
  • gemini: Gemini chat completions via the unified client.
  • bedrock: AWS Bedrock chat completions (Nova etc.) via the unified client.
  • proxy: Using InferenceModelProvider::Proxy("proxy_name") to call a OpenAI completions-compatible endpoint.
  • tracing: Same OpenAI-style flow as openai, but with tracing_subscriber::fmt() configured to emit spans and events to the console (stdout).
  • tracing_otlp: Shows how to wire vllora_telemetry::events::layer to an OTLP HTTP exporter (e.g. New Relic / any OTLP collector) and emit spans from VlloraLLMClient calls to a remote telemetry backend.

Each example is a standalone Cargo binary; you can cd into a directory and run:

cargo run

after setting the provider-specific environment variables noted in the example’s main.rs.

Notes

  • Real usage: In the full LangDB / Vllora gateway, concrete ModelInstance implementations are created by the core executor based on your models.yaml and routing rules; the examples above use DummyModelInstance only to illustrate the public API of the CompletionsClient.
  • Error handling: All client methods return LLMResult<T>, which wraps rich LLMError variants (network, mapping, provider errors, etc.).
  • More features: The same types in vllora_llm::types::gateway are used for tools, MCP, routing, embeddings, and image generation; see the main repository docs at https://vllora.dev/docs for higher-level gateway features.

Roadmap and issues

  • GitHub issues / roadmap: See open LLM crate issues for planned and outstanding work.
  • Planned enhancements:
    • Integrate responses API
    • Support builtin MCP tool calls
    • Gemini prompt caching supported
    • Full thinking messages support

License