Foundry Local Rust SDK
The Foundry Local Rust SDK provides an async Rust interface for running AI models locally on your machine. Discover, download, load, and run inference — all without cloud dependencies.
Features
- Local-first AI — Run models entirely on your machine with no cloud calls
- Model catalog — Browse and discover available models; check what's cached or loaded
- Automatic model management — Download, load, unload, and remove models from cache
- Chat completions — OpenAI-compatible chat API with both non-streaming and streaming responses
- Embeddings — Generate text embeddings via OpenAI-compatible API
- Audio transcription — Transcribe audio files locally with streaming support
- Tool calling — Function/tool calling with streaming, multi-turn conversation support
- Response format control — Text, JSON, JSON Schema, and Lark grammar constrained output
- Multi-variant models — Models can have multiple variants (e.g., different quantizations) with automatic selection of the best cached variant
- Embedded web service — Start a local HTTP server for OpenAI-compatible API access
- WinML support — Automatic execution provider download on Windows for NPU/GPU acceleration
- Configurable inference — Control temperature, max tokens, top-k, top-p, frequency penalty, random seed, and more
- Async-first — Every operation is
async; designed for use with thetokioruntime - Safe FFI — Dynamically loads the native Foundry Local Core engine with a safe Rust wrapper
Prerequisites
- Rust 1.70+ (stable toolchain)
- An internet connection during first build (to download native libraries)
Installation
Or add to your Cargo.toml:
[]
= "0.1"
You also need an async runtime. Most examples use tokio:
[]
= { = "1", = ["rt-multi-thread", "macros"] }
= "0.1" # for StreamExt on streaming responses
Feature Flags
| Feature | Description |
|---|---|
winml |
Use the WinML backend (Windows only). Selects different ONNX Runtime and GenAI packages for NPU/GPU acceleration. |
nightly |
Resolve the latest nightly build of the Core package from the ORT-Nightly feed. |
Enable features in Cargo.toml:
[]
= { = "0.1", = ["winml"] }
Note: The
winmlfeature is only relevant on Windows. On macOS and Linux, the standard build is used regardless. No code changes are needed — your application code stays the same.
With winml enabled on Windows, the build downloads Microsoft.Windows.AI.MachineLearning.dll from the pinned Microsoft.Windows.AI.MachineLearning NuGet version. Set FOUNDRY_LOCAL_WINDOWS_AI_MACHINELEARNING_VERSION before cargo build to use a newer runtime DLL, or set FOUNDRY_NATIVE_OVERRIDE_DIR to a directory containing the DLL.
Explicit EP Management
You can explicitly discover and download execution providers:
use ;
let manager = create?;
// Discover available EPs and their status
let eps = manager.discover_eps?;
for ep in &eps
// Download and register all available EPs
let result = manager.download_and_register_eps.await?;
println!;
// Download only specific EPs
let result = manager.download_and_register_eps.await?;
Per-EP download progress
Use download_and_register_eps_with_progress to receive typed (ep_name, percent) updates
as each EP downloads (percent is 0.0–100.0):
use ;
let current_ep = new;
let ep = clone;
manager.download_and_register_eps_with_progress.await?;
println!;
Cancelling model and EP downloads
Use a shared Arc<AtomicBool> with the download builders. Set the flag from another task or signal handler to stop the in-progress download.
use ;
// manager and model already initialized
let cancel_flag = new;
// call cancel_flag.store(true, ...) from another task or signal handler to cancel
manager
.download_and_register_eps_builder
.cancel
.run
.await?;
model
.download_builder
.cancel
.run
.await?;
Catalog access does not block on EP downloads. Call download_and_register_eps when you need hardware-accelerated execution providers.
Quick Start
use ;
async
Usage
Browsing the Model Catalog
The Catalog lets you discover what models are available, which are already cached locally, and which are currently loaded in memory.
let catalog = manager.catalog;
// List all available models
let models = catalog.get_models.await?;
for model in &models
// Look up a specific model by alias
let model = catalog.get_model.await?;
// Look up a specific variant by its unique model ID
let variant = catalog.get_model_variant.await?;
// See what's already downloaded
let cached = catalog.get_cached_models.await?;
// See what's currently loaded in memory
let loaded = catalog.get_loaded_models.await?;
Model Lifecycle
Each model may have multiple variants (different quantizations, hardware targets). The SDK auto-selects the best available variant, preferring cached versions. All models are represented by the Model type.
let model = catalog.get_model.await?;
// Inspect available variants
println!;
for v in model.variants
Download, load, and unload:
// Download with progress reporting
model.download.await?;
// Or use the builder when combining progress, cancellation, or future options
let cancel_flag = new;
model.download_builder
.progress
.cancel
.run
.await?;
// Load into memory
model.load.await?;
// Unload when done
model.unload.await?;
// Remove from local cache entirely
model.remove_from_cache.await?;
Chat Completions
The ChatClient follows the OpenAI Chat Completion API structure.
let client = model.create_chat_client
// Configure generation settings (fluent builder)
.temperature
.max_tokens
.top_p
.frequency_penalty;
// Non-streaming completion
let response = client.complete_chat.await?;
println!;
Streaming Responses
For real-time token-by-token output, use streaming:
use StreamExt;
let mut stream = client.complete_streaming_chat.await?;
while let Some = stream.next.await
// Errors from the native core are delivered as stream items —
// no separate close() call needed.
Tool Calling
Define functions the model can call and handle the multi-turn conversation:
use ;
use json;
// Define available tools
let tools: = from_value?;
let client = model.create_chat_client
.max_tokens
.tool_choice;
let mut messages: = vec!;
// First request — model may call a tool
let response = client.complete_chat.await?;
let choice = &response.choices;
if choice.finish_reason == Some
Tool calling also works with streaming via complete_streaming_chat — accumulate tool call fragments during streaming and check for FinishReason::ToolCalls.
Response Format Options
Control the output format of chat completions:
use ChatResponseFormat;
// Plain text (default)
let client = model.create_chat_client
.response_format;
// Unstructured JSON output
let client = model.create_chat_client
.response_format;
// JSON constrained to a schema
let client = model.create_chat_client
.response_format;
// Output constrained by a Lark grammar (Foundry extension)
let client = model.create_chat_client
.response_format;
Embeddings
Generate text embeddings using the EmbeddingClient:
let embedding_client = model.create_embedding_client;
// Single input
let response = embedding_client
.generate_embedding
.await?;
let embedding = &response.data.embedding; // Vec<f32>
println!;
// Batch input
let batch_response = embedding_client
.generate_embeddings
.await?;
// batch_response.data[0].embedding, batch_response.data[1].embedding
Audio Transcription
Transcribe audio files locally using the AudioClient:
let model = manager.catalog.get_model.await?;
model.load.await?;
let audio_client = model.create_audio_client
.language;
// Non-streaming transcription
let result = audio_client.transcribe.await?;
println!;
Streaming Transcription
use StreamExt;
let mut stream = audio_client.transcribe_streaming.await?;
while let Some = stream.next.await
Embedded Web Service
Start a local HTTP server that exposes an OpenAI-compatible REST API:
manager.start_web_service.await?;
let urls = manager.urls?;
println!;
// Any OpenAI-compatible client or tool can now connect to the endpoint.
// ...
manager.stop_web_service.await?;
Chat Client Settings
All settings are configured via chainable builder methods on ChatClient:
| Method | Type | Description |
|---|---|---|
temperature(v) |
f64 |
Sampling temperature (0.0–2.0; higher = more random) |
max_tokens(v) |
u32 |
Maximum number of tokens to generate |
top_p(v) |
f64 |
Nucleus sampling probability (0.0–1.0) |
top_k(v) |
u32 |
Top-k sampling parameter (Foundry extension) |
frequency_penalty(v) |
f64 |
Frequency penalty |
presence_penalty(v) |
f64 |
Presence penalty |
n(v) |
u32 |
Number of completions to generate |
random_seed(v) |
u64 |
Random seed for reproducible results (Foundry extension) |
response_format(v) |
ChatResponseFormat |
Output format (Text, JsonObject, JsonSchema, LarkGrammar) |
tool_choice(v) |
ChatToolChoice |
Tool selection strategy (None, Auto, Required, Function) |
Error Handling
All fallible operations return foundry_local_sdk::Result<T>, which is an alias for std::result::Result<T, FoundryLocalError>.
use FoundryLocalError;
match manager.catalog.get_model.await
Error Variants
| Variant | Description |
|---|---|
LibraryLoad { reason } |
The native core library could not be loaded |
CommandExecution { reason } |
A command executed against native core returned an error |
InvalidConfiguration { reason } |
The provided configuration is invalid |
ModelOperation { reason } |
A model operation failed (load, unload, download, etc.) |
HttpRequest(reqwest::Error) |
An HTTP request to an external service failed |
Serialization(serde_json::Error) |
JSON serialization/deserialization failed |
Validation { reason } |
A validation check on user-supplied input failed |
Io(std::io::Error) |
An I/O error occurred |
Internal { reason } |
An internal SDK error (e.g. poisoned lock) |
Configuration
The SDK is configured via FoundryLocalConfig when creating the manager:
use ;
let config = new
.log_level
.model_cache_dir
.web_service_urls;
let manager = create?;
| Setting | Builder method | Default | Description |
|---|---|---|---|
| App name | new(name) |
(required) | Your application name |
| App data dir | .app_data_dir(dir) |
~/.{app_name} |
Application data directory |
| Model cache dir | .model_cache_dir(dir) |
{app_data_dir}/cache/models |
Where models are stored locally |
| Logs dir | .logs_dir(dir) |
{app_data_dir}/logs |
Log output directory |
| Log level | .log_level(level) |
Warn |
Trace, Debug, Info, Warn, Error, Fatal |
| Web service URLs | .web_service_urls(urls) |
None |
Bind address for the embedded web service |
| Service endpoint | .service_endpoint(url) |
None |
URL of an existing external service to connect to |
| Library path | .library_path(path) |
Auto-discovered | Path to native Foundry Local Core libraries |
| Additional settings | .additional_setting(k, v) |
None |
Extra key-value settings passed to Core |
| Logger | .logger(impl Logger) |
None |
Application logger (stub — not yet wired) |
How It Works
Native Library Download
The build.rs build script automatically downloads the required native libraries at compile time:
- Queries NuGet/ORT-Nightly feeds for package metadata
- Downloads
.nupkgpackages (zip archives) - Extracts platform-specific native libraries (
.dll,.so, or.dylib) - Places them in Cargo's
OUT_DIRfor runtime discovery
Downloaded libraries are cached — subsequent builds skip the download step.
Runtime Loading
At runtime, the SDK uses libloading to dynamically load the Foundry Local Core library and resolve function pointers. No static linking or system-wide installation is required.
Platform Support
| Platform | RID | Status |
|---|---|---|
| Windows x64 | win-x64 |
✅ |
| Windows ARM64 | win-arm64 |
✅ |
| Linux x64 | linux-x64 |
✅ |
| Linux ARM64 | linux-arm64 |
✅ |
| macOS ARM64 | osx-arm64 |
✅ |
Running Examples
Sample applications are available in samples/rust/:
| Sample | Description |
|---|---|
native-chat-completions |
Non-streaming and streaming chat completions |
tool-calling-foundry-local |
Function/tool calling with multi-turn conversations |
audio-transcription-example |
Audio transcription (non-streaming and streaming) |
foundry-local-webserver |
Embedded OpenAI-compatible REST API server |
Run a sample with:
License
Microsoft Software License Terms — see LICENSE for details.