foundry-local-sdk-1.2.0-rc1 has been yanked.

Visit the last successful build: foundry-local-sdk-1.0.0

Foundry Local Rust SDK

The Foundry Local Rust SDK provides an async Rust interface for running AI models locally on your machine. Discover, download, load, and run inference — all without cloud dependencies.

Features

Local-first AI — Run models entirely on your machine with no cloud calls
Model catalog — Browse and discover available models; check what's cached or loaded
Automatic model management — Download, load, unload, and remove models from cache
Chat completions — OpenAI-compatible chat API with both non-streaming and streaming responses
Embeddings — Generate text embeddings via OpenAI-compatible API
Audio transcription — Transcribe audio files locally with streaming support
Tool calling — Function/tool calling with streaming, multi-turn conversation support
Response format control — Text, JSON, JSON Schema, and Lark grammar constrained output
Multi-variant models — Models can have multiple variants (e.g., different quantizations) with automatic selection of the best cached variant
Embedded web service — Start a local HTTP server for OpenAI-compatible API access
WinML support — Automatic execution provider download on Windows for NPU/GPU acceleration
Configurable inference — Control temperature, max tokens, top-k, top-p, frequency penalty, random seed, and more
Async-first — Every operation is async; designed for use with the tokio runtime
Safe FFI — Dynamically loads the native Foundry Local Core engine with a safe Rust wrapper

Prerequisites

Rust 1.70+ (stable toolchain)
An internet connection during first build (to download native libraries)

Installation

cargo add foundry-local-sdk

Or add to your Cargo.toml:

[dependencies]

foundry-local-sdk = "0.1"

You also need an async runtime. Most examples use tokio:

[dependencies]

tokio = { version = "1", features = ["rt-multi-thread", "macros"] }

tokio-stream = "0.1"       # for StreamExt on streaming responses

Feature Flags

Feature	Description
`winml`	Use the WinML backend (Windows only). Selects different ONNX Runtime and GenAI packages for NPU/GPU acceleration.
`nightly`	Resolve the latest nightly build of the Core package from the ORT-Nightly feed.

Enable features in Cargo.toml:

[dependencies]

foundry-local-sdk = { version = "0.1", features = ["winml"] }

Note: The winml feature is only relevant on Windows. On macOS and Linux, the standard build is used regardless. No code changes are needed — your application code stays the same.

With winml enabled on Windows, the build downloads Microsoft.Windows.AI.MachineLearning.dll from the pinned Microsoft.Windows.AI.MachineLearning NuGet version. Set FOUNDRY_LOCAL_WINDOWS_AI_MACHINELEARNING_VERSION before cargo build to use a newer runtime DLL, or set FOUNDRY_NATIVE_OVERRIDE_DIR to a directory containing the DLL.

Explicit EP Management

You can explicitly discover and download execution providers:

use foundry_local_sdk::{FoundryLocalConfig, FoundryLocalManager};

let manager = FoundryLocalManager::create(FoundryLocalConfig::new("my_app"))?;

// Discover available EPs and their status
let eps = manager.discover_eps()?;
for ep in &eps {
    println!("{} — registered: {}", ep.name, ep.is_registered);
}

// Download and register all available EPs
let result = manager.download_and_register_eps(None).await?;
println!("Success: {}, Status: {}", result.success, result.status);

// Download only specific EPs
let result = manager.download_and_register_eps(Some(&[eps[0].name.as_str()])).await?;

Per-EP download progress

Use download_and_register_eps_with_progress to receive typed (ep_name, percent) updates as each EP downloads (percent is 0.0–100.0):

use std::sync::{Arc, Mutex};

let current_ep = Arc::new(Mutex::new(String::new()));
let ep = Arc::clone(&current_ep);
manager.download_and_register_eps_with_progress(None, move |ep_name: &str, percent: f64| {
    let mut current = ep.lock().unwrap();
    if ep_name != current.as_str() {
        if !current.is_empty() {
            println!();
        }
        *current = ep_name.to_string();
    }
    print!("\r  {}  {:5.1}%", ep_name, percent);
}).await?;
println!();

Cancelling model and EP downloads

Use a shared Arc<AtomicBool> with the download builders. Set the flag from another task or signal handler to stop the in-progress download.

use std::sync::{
    atomic::AtomicBool,
    Arc,
};

// manager and model already initialized
let cancel_flag = Arc::new(AtomicBool::new(false));
// call cancel_flag.store(true, ...) from another task or signal handler to cancel

manager
    .download_and_register_eps_builder()
    .cancel(Arc::clone(&cancel_flag))
    .run()
    .await?;
model
    .download_builder()
    .cancel(Arc::clone(&cancel_flag))
    .run()
    .await?;

Catalog access does not block on EP downloads. Call download_and_register_eps when you need hardware-accelerated execution providers.

Quick Start

use foundry_local_sdk::{
    ChatCompletionRequestMessage, ChatCompletionRequestSystemMessage,
    ChatCompletionRequestUserMessage, FoundryLocalConfig, FoundryLocalManager,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Initialize the manager — loads native libraries and starts the engine
    let manager = FoundryLocalManager::create(FoundryLocalConfig::new("my_app"))?;

    // 2. Get a model from the catalog and load it
    let model = manager.catalog().get_model("phi-3.5-mini").await?;
    model.load().await?;

    // 3. Create a chat client and run inference
    let client = model.create_chat_client()
        .temperature(0.7)
        .max_tokens(256);

    let messages: Vec<ChatCompletionRequestMessage> = vec![
        ChatCompletionRequestSystemMessage::from("You are a helpful assistant.").into(),
        ChatCompletionRequestUserMessage::from("What is the capital of France?").into(),
    ];

    let response = client.complete_chat(&messages, None).await?;
    println!("{}", response.choices[0].message.content.as_deref().unwrap_or(""));

    // 4. Clean up
    model.unload().await?;

    Ok(())
}

Usage

Browsing the Model Catalog

The Catalog lets you discover what models are available, which are already cached locally, and which are currently loaded in memory.

let catalog = manager.catalog();

// List all available models
let models = catalog.get_models().await?;
for model in &models {
    println!("{} (id: {})", model.alias(), model.id());
}

// Look up a specific model by alias
let model = catalog.get_model("phi-3.5-mini").await?;

// Look up a specific variant by its unique model ID
let variant = catalog.get_model_variant("phi-3.5-mini-generic-gpu-4").await?;

// See what's already downloaded
let cached = catalog.get_cached_models().await?;

// See what's currently loaded in memory
let loaded = catalog.get_loaded_models().await?;

Model Lifecycle

Each model may have multiple variants (different quantizations, hardware targets). The SDK auto-selects the best available variant, preferring cached versions. All models are represented by the Model type.

let model = catalog.get_model("phi-3.5-mini").await?;

// Inspect available variants
println!("Selected: {}", model.id());
for v in model.variants() {
    println!("  {} (info.cached: {})", v.id(), v.info().cached);
}

Download, load, and unload:

// Download with progress reporting
model.download(Some(|progress: f64| {
    print!("\r{progress:.1}%");
    std::io::Write::flush(&mut std::io::stdout()).ok();
})).await?;

// Or use the builder when combining progress, cancellation, or future options
let cancel_flag = std::sync::Arc::new(std::sync::atomic::AtomicBool::new(false));
model.download_builder()
    .progress(|progress| {
        print!("\r{progress:.1}%");
        std::io::Write::flush(&mut std::io::stdout()).ok();
    })
    .cancel(cancel_flag.clone())
    .run()
    .await?;

// Load into memory
model.load().await?;

// Unload when done
model.unload().await?;

// Remove from local cache entirely
model.remove_from_cache().await?;

Chat Completions

The ChatClient follows the OpenAI Chat Completion API structure.

let client = model.create_chat_client()

// Configure generation settings (fluent builder)
    .temperature(0.7)
    .max_tokens(256)
    .top_p(0.9)
    .frequency_penalty(0.5);

// Non-streaming completion
let response = client.complete_chat(
    &[
        ChatCompletionRequestSystemMessage::from("You are a helpful assistant.").into(),
        ChatCompletionRequestUserMessage::from("Explain Rust's ownership model.").into(),
    ],
    None,
).await?;

println!("{}", response.choices[0].message.content.as_deref().unwrap_or(""));

Streaming Responses

For real-time token-by-token output, use streaming:

use tokio_stream::StreamExt;

let mut stream = client.complete_streaming_chat(
    &[ChatCompletionRequestUserMessage::from("Write a short poem about Rust.").into()],
    None,
).await?;

while let Some(chunk) = stream.next().await {
    let chunk = chunk?;
    if let Some(content) = &chunk.choices[0].delta.content {
        print!("{content}");
    }
}

// Errors from the native core are delivered as stream items —
// no separate close() call needed.

Tool Calling

Define functions the model can call and handle the multi-turn conversation:

use foundry_local_sdk::{
    ChatCompletionRequestMessage, ChatCompletionRequestToolMessage,
    ChatCompletionTools, ChatToolChoice, FinishReason,
};
use serde_json::json;

// Define available tools
let tools: Vec<ChatCompletionTools> = serde_json::from_value(json!([{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": { "type": "string", "description": "City name" }
            },
            "required": ["location"]
        }
    }
}]))?;

let client = model.create_chat_client()
    .max_tokens(512)
    .tool_choice(ChatToolChoice::Auto);

let mut messages: Vec<ChatCompletionRequestMessage> = vec![
    ChatCompletionRequestUserMessage::from("What's the weather in Seattle?").into(),
];

// First request — model may call a tool
let response = client.complete_chat(&messages, Some(&tools)).await?;
let choice = &response.choices[0];

if choice.finish_reason == Some(FinishReason::ToolCalls) {
    if let Some(tool_calls) = &choice.message.tool_calls {
        for tc in tool_calls {
            // Execute the tool (your application logic)
            let result = execute_tool(&tc.function.name, &tc.function.arguments);

            // Add assistant message with tool calls, then the tool result
            messages.push(serde_json::from_value(json!({
                "role": "assistant",
                "content": null,
                "tool_calls": [{ "id": tc.id, "type": "function",
                    "function": { "name": tc.function.name,
                                  "arguments": tc.function.arguments } }]
            }))?);
            messages.push(ChatCompletionRequestToolMessage {
                content: result.into(),
                tool_call_id: tc.id.clone(),
            }.into());
        }

        // Continue the conversation with tool results
        let final_response = client.complete_chat(&messages, Some(&tools)).await?;
        println!("{}", final_response.choices[0].message.content.as_deref().unwrap_or(""));
    }
}

Tool calling also works with streaming via complete_streaming_chat — accumulate tool call fragments during streaming and check for FinishReason::ToolCalls.

Response Format Options

Control the output format of chat completions:

use foundry_local_sdk::ChatResponseFormat;

// Plain text (default)
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::Text);

// Unstructured JSON output
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::JsonObject);

// JSON constrained to a schema
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::JsonSchema(r#"{
        "type": "object",
        "properties": {
            "name": { "type": "string" },
            "age": { "type": "integer" }
        },
        "required": ["name", "age"]
    }"#.to_string()));

// Output constrained by a Lark grammar (Foundry extension)
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::LarkGrammar(grammar.to_string()));

Embeddings

Generate text embeddings using the EmbeddingClient:

let embedding_client = model.create_embedding_client();

// Single input
let response = embedding_client
    .generate_embedding("The quick brown fox jumps over the lazy dog")
    .await?;
let embedding = &response.data[0].embedding; // Vec<f32>
println!("Dimensions: {}", embedding.len());

// Batch input
let batch_response = embedding_client
    .generate_embeddings(&["The quick brown fox", "The capital of France is Paris"])
    .await?;
// batch_response.data[0].embedding, batch_response.data[1].embedding

Audio Transcription

Transcribe audio files locally using the AudioClient:

let model = manager.catalog().get_model("whisper-tiny").await?;
model.load().await?;

let audio_client = model.create_audio_client()
    .language("en");

// Non-streaming transcription
let result = audio_client.transcribe("recording.wav").await?;
println!("{}", result.text);

Streaming Transcription

use tokio_stream::StreamExt;

let mut stream = audio_client.transcribe_streaming("recording.wav").await?;
while let Some(chunk) = stream.next().await {
    print!("{}", chunk?.text);
}

Embedded Web Service

Start a local HTTP server that exposes an OpenAI-compatible REST API:

manager.start_web_service().await?;
let urls = manager.urls()?;
println!("Service running at: {:?}", urls);

// Any OpenAI-compatible client or tool can now connect to the endpoint.
// ...

manager.stop_web_service().await?;

Chat Client Settings

All settings are configured via chainable builder methods on ChatClient:

Method	Type	Description
`temperature(v)`	`f64`	Sampling temperature (0.0–2.0; higher = more random)
`max_tokens(v)`	`u32`	Maximum number of tokens to generate
`top_p(v)`	`f64`	Nucleus sampling probability (0.0–1.0)
`top_k(v)`	`u32`	Top-k sampling parameter (Foundry extension)
`frequency_penalty(v)`	`f64`	Frequency penalty
`presence_penalty(v)`	`f64`	Presence penalty
`n(v)`	`u32`	Number of completions to generate
`random_seed(v)`	`u64`	Random seed for reproducible results (Foundry extension)
`response_format(v)`	`ChatResponseFormat`	Output format (Text, JsonObject, JsonSchema, LarkGrammar)
`tool_choice(v)`	`ChatToolChoice`	Tool selection strategy (None, Auto, Required, Function)

Error Handling

All fallible operations return foundry_local_sdk::Result<T>, which is an alias for std::result::Result<T, FoundryLocalError>.

use foundry_local_sdk::FoundryLocalError;

match manager.catalog().get_model("nonexistent").await {
    Ok(model) => { /* use model */ }
    Err(FoundryLocalError::ModelOperation { reason }) => {
        eprintln!("Model error: {reason}");
    }
    Err(FoundryLocalError::CommandExecution { reason }) => {
        eprintln!("Core engine error: {reason}");
    }
    Err(e) => {
        eprintln!("Unexpected error: {e}");
    }
}

Error Variants

Variant	Description
`LibraryLoad { reason }`	The native core library could not be loaded
`CommandExecution { reason }`	A command executed against native core returned an error
`InvalidConfiguration { reason }`	The provided configuration is invalid
`ModelOperation { reason }`	A model operation failed (load, unload, download, etc.)
`HttpRequest(reqwest::Error)`	An HTTP request to an external service failed
`Serialization(serde_json::Error)`	JSON serialization/deserialization failed
`Validation { reason }`	A validation check on user-supplied input failed
`Io(std::io::Error)`	An I/O error occurred
`Internal { reason }`	An internal SDK error (e.g. poisoned lock)

Configuration

The SDK is configured via FoundryLocalConfig when creating the manager:

use foundry_local_sdk::{FoundryLocalConfig, LogLevel};

let config = FoundryLocalConfig::new("my_app")
    .log_level(LogLevel::Info)
    .model_cache_dir("/path/to/cache")
    .web_service_urls("http://127.0.0.1:5000");

let manager = FoundryLocalManager::create(config)?;

Setting	Builder method	Default	Description
App name	`new(name)`	(required)	Your application name
App data dir	`.app_data_dir(dir)`	`~/.{app_name}`	Application data directory
Model cache dir	`.model_cache_dir(dir)`	`{app_data_dir}/cache/models`	Where models are stored locally
Logs dir	`.logs_dir(dir)`	`{app_data_dir}/logs`	Log output directory
Log level	`.log_level(level)`	`Warn`	`Trace`, `Debug`, `Info`, `Warn`, `Error`, `Fatal`
Web service URLs	`.web_service_urls(urls)`	`None`	Bind address for the embedded web service
Service endpoint	`.service_endpoint(url)`	`None`	URL of an existing external service to connect to
Library path	`.library_path(path)`	Auto-discovered	Path to native Foundry Local Core libraries
Additional settings	`.additional_setting(k, v)`	`None`	Extra key-value settings passed to Core
Logger	`.logger(impl Logger)`	`None`	Application logger (stub — not yet wired)

How It Works

Native Library Download

The build.rs build script automatically downloads the required native libraries at compile time:

Queries NuGet/ORT-Nightly feeds for package metadata
Downloads .nupkg packages (zip archives)
Extracts platform-specific native libraries (.dll, .so, or .dylib)
Places them in Cargo's OUT_DIR for runtime discovery

Downloaded libraries are cached — subsequent builds skip the download step.

Runtime Loading

At runtime, the SDK uses libloading to dynamically load the Foundry Local Core library and resolve function pointers. No static linking or system-wide installation is required.

Platform Support

Platform	RID	Status
Windows x64	`win-x64`	✅
Windows ARM64	`win-arm64`	✅
Linux x64	`linux-x64`	✅
Linux ARM64	`linux-arm64`	✅
macOS ARM64	`osx-arm64`	✅

Running Examples

Sample applications are available in samples/rust/:

Sample	Description
`native-chat-completions`	Non-streaming and streaming chat completions
`tool-calling-foundry-local`	Function/tool calling with multi-turn conversations
`audio-transcription-example`	Audio transcription (non-streaming and streaming)
`foundry-local-webserver`	Embedded OpenAI-compatible REST API server

Run a sample with:

cd samples/rust

cargo run -p native-chat-completions

License

Microsoft Software License Terms — see LICENSE for details.

foundry-local-sdk 1.2.0-rc1