foundry-local-sdk 1.0.0

Local AI model inference powered by the Foundry Local Core engine
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
# Foundry Local Rust SDK


The Foundry Local Rust SDK provides an async Rust interface for running AI models locally on your machine. Discover, download, load, and run inference — all without cloud dependencies.

## Features


- **Local-first AI** — Run models entirely on your machine with no cloud calls
- **Model catalog** — Browse and discover available models; check what's cached or loaded
- **Automatic model management** — Download, load, unload, and remove models from cache
- **Chat completions** — OpenAI-compatible chat API with both non-streaming and streaming responses
- **Audio transcription** — Transcribe audio files locally with streaming support
- **Tool calling** — Function/tool calling with streaming, multi-turn conversation support
- **Response format control** — Text, JSON, JSON Schema, and Lark grammar constrained output
- **Multi-variant models** — Models can have multiple variants (e.g., different quantizations) with automatic selection of the best cached variant
- **Embedded web service** — Start a local HTTP server for OpenAI-compatible API access
- **WinML support** — Automatic execution provider download on Windows for NPU/GPU acceleration
- **Configurable inference** — Control temperature, max tokens, top-k, top-p, frequency penalty, random seed, and more
- **Async-first** — Every operation is `async`; designed for use with the `tokio` runtime
- **Safe FFI** — Dynamically loads the native Foundry Local Core engine with a safe Rust wrapper

## Prerequisites


- **Rust** 1.70+ (stable toolchain)
- An internet connection during first build (to download native libraries)

## Installation


```sh
cargo add foundry-local-sdk
```

Or add to your `Cargo.toml`:

```toml
[dependencies]
foundry-local-sdk = "0.1"
```

You also need an async runtime. Most examples use [tokio](https://crates.io/crates/tokio):

```toml
[dependencies]
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
tokio-stream = "0.1"       # for StreamExt on streaming responses
```

### Feature Flags


| Feature   | Description |
|-----------|-------------|
| `winml`   | Use the WinML backend (Windows only). Selects different ONNX Runtime and GenAI packages for NPU/GPU acceleration. |
| `nightly` | Resolve the latest nightly build of the Core package from the ORT-Nightly feed. |

Enable features in `Cargo.toml`:

```toml
[dependencies]
foundry-local-sdk = { version = "0.1", features = ["winml"] }
```

> **Note:** The `winml` feature is only relevant on Windows. On macOS and Linux, the standard build is used regardless. No code changes are needed — your application code stays the same.

### Explicit EP Management


You can explicitly discover and download execution providers:

```rust
use foundry_local_sdk::{FoundryLocalConfig, FoundryLocalManager};

let manager = FoundryLocalManager::create(FoundryLocalConfig::new("my_app"))?;

// Discover available EPs and their status
let eps = manager.discover_eps()?;
for ep in &eps {
    println!("{} — registered: {}", ep.name, ep.is_registered);
}

// Download and register all available EPs
let result = manager.download_and_register_eps(None).await?;
println!("Success: {}, Status: {}", result.success, result.status);

// Download only specific EPs
let result = manager.download_and_register_eps(Some(&[eps[0].name.as_str()])).await?;
```

#### Per-EP download progress


Use `download_and_register_eps_with_progress` to receive typed `(ep_name, percent)` updates
as each EP downloads (`percent` is 0.0–100.0):

```rust
use std::sync::{Arc, Mutex};

let current_ep = Arc::new(Mutex::new(String::new()));
let ep = Arc::clone(&current_ep);
manager.download_and_register_eps_with_progress(None, move |ep_name: &str, percent: f64| {
    let mut current = ep.lock().unwrap();
    if ep_name != current.as_str() {
        if !current.is_empty() {
            println!();
        }
        *current = ep_name.to_string();
    }
    print!("\r  {}  {:5.1}%", ep_name, percent);
}).await?;
println!();
```

Catalog access does not block on EP downloads. Call `download_and_register_eps` when you need hardware-accelerated execution providers.

## Quick Start


```rust
use foundry_local_sdk::{
    ChatCompletionRequestMessage, ChatCompletionRequestSystemMessage,
    ChatCompletionRequestUserMessage, FoundryLocalConfig, FoundryLocalManager,
};

#[tokio::main]

async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Initialize the manager — loads native libraries and starts the engine
    let manager = FoundryLocalManager::create(FoundryLocalConfig::new("my_app"))?;

    // 2. Get a model from the catalog and load it
    let model = manager.catalog().get_model("phi-3.5-mini").await?;
    model.load().await?;

    // 3. Create a chat client and run inference
    let client = model.create_chat_client()
        .temperature(0.7)
        .max_tokens(256);

    let messages: Vec<ChatCompletionRequestMessage> = vec![
        ChatCompletionRequestSystemMessage::from("You are a helpful assistant.").into(),
        ChatCompletionRequestUserMessage::from("What is the capital of France?").into(),
    ];

    let response = client.complete_chat(&messages, None).await?;
    println!("{}", response.choices[0].message.content.as_deref().unwrap_or(""));

    // 4. Clean up
    model.unload().await?;

    Ok(())
}
```

## Usage


### Browsing the Model Catalog


The `Catalog` lets you discover what models are available, which are already cached locally, and which are currently loaded in memory.

```rust
let catalog = manager.catalog();

// List all available models
let models = catalog.get_models().await?;
for model in &models {
    println!("{} (id: {})", model.alias(), model.id());
}

// Look up a specific model by alias
let model = catalog.get_model("phi-3.5-mini").await?;

// Look up a specific variant by its unique model ID
let variant = catalog.get_model_variant("phi-3.5-mini-generic-gpu-4").await?;

// See what's already downloaded
let cached = catalog.get_cached_models().await?;

// See what's currently loaded in memory
let loaded = catalog.get_loaded_models().await?;
```

### Model Lifecycle


Each model may have multiple variants (different quantizations, hardware targets). The SDK auto-selects the best available variant, preferring cached versions. All models are represented by the `Model` type.

```rust
let model = catalog.get_model("phi-3.5-mini").await?;

// Inspect available variants
println!("Selected: {}", model.id());
for v in model.variants() {
    println!("  {} (info.cached: {})", v.id(), v.info().cached);
}
```

Download, load, and unload:

```rust
// Download with progress reporting
model.download(Some(|progress: f64| {
    print!("\r{progress:.1}%");
    std::io::Write::flush(&mut std::io::stdout()).ok();
})).await?;

// Load into memory
model.load().await?;

// Unload when done
model.unload().await?;

// Remove from local cache entirely
model.remove_from_cache().await?;
```

### Chat Completions


The `ChatClient` follows the OpenAI Chat Completion API structure.

```rust
let client = model.create_chat_client()

// Configure generation settings (fluent builder)
    .temperature(0.7)
    .max_tokens(256)
    .top_p(0.9)
    .frequency_penalty(0.5);

// Non-streaming completion
let response = client.complete_chat(
    &[
        ChatCompletionRequestSystemMessage::from("You are a helpful assistant.").into(),
        ChatCompletionRequestUserMessage::from("Explain Rust's ownership model.").into(),
    ],
    None,
).await?;

println!("{}", response.choices[0].message.content.as_deref().unwrap_or(""));
```

### Streaming Responses


For real-time token-by-token output, use streaming:

```rust
use tokio_stream::StreamExt;

let mut stream = client.complete_streaming_chat(
    &[ChatCompletionRequestUserMessage::from("Write a short poem about Rust.").into()],
    None,
).await?;

while let Some(chunk) = stream.next().await {
    let chunk = chunk?;
    if let Some(content) = &chunk.choices[0].delta.content {
        print!("{content}");
    }
}

// Errors from the native core are delivered as stream items —
// no separate close() call needed.
```

### Tool Calling


Define functions the model can call and handle the multi-turn conversation:

```rust
use foundry_local_sdk::{
    ChatCompletionRequestMessage, ChatCompletionRequestToolMessage,
    ChatCompletionTools, ChatToolChoice, FinishReason,
};
use serde_json::json;

// Define available tools
let tools: Vec<ChatCompletionTools> = serde_json::from_value(json!([{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": { "type": "string", "description": "City name" }
            },
            "required": ["location"]
        }
    }
}]))?;

let client = model.create_chat_client()
    .max_tokens(512)
    .tool_choice(ChatToolChoice::Auto);

let mut messages: Vec<ChatCompletionRequestMessage> = vec![
    ChatCompletionRequestUserMessage::from("What's the weather in Seattle?").into(),
];

// First request — model may call a tool
let response = client.complete_chat(&messages, Some(&tools)).await?;
let choice = &response.choices[0];

if choice.finish_reason == Some(FinishReason::ToolCalls) {
    if let Some(tool_calls) = &choice.message.tool_calls {
        for tc in tool_calls {
            // Execute the tool (your application logic)
            let result = execute_tool(&tc.function.name, &tc.function.arguments);

            // Add assistant message with tool calls, then the tool result
            messages.push(serde_json::from_value(json!({
                "role": "assistant",
                "content": null,
                "tool_calls": [{ "id": tc.id, "type": "function",
                    "function": { "name": tc.function.name,
                                  "arguments": tc.function.arguments } }]
            }))?);
            messages.push(ChatCompletionRequestToolMessage {
                content: result.into(),
                tool_call_id: tc.id.clone(),
            }.into());
        }

        // Continue the conversation with tool results
        let final_response = client.complete_chat(&messages, Some(&tools)).await?;
        println!("{}", final_response.choices[0].message.content.as_deref().unwrap_or(""));
    }
}
```

Tool calling also works with streaming via `complete_streaming_chat` — accumulate tool call fragments during streaming and check for `FinishReason::ToolCalls`.

### Response Format Options


Control the output format of chat completions:

```rust
use foundry_local_sdk::ChatResponseFormat;

// Plain text (default)
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::Text);

// Unstructured JSON output
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::JsonObject);

// JSON constrained to a schema
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::JsonSchema(r#"{
        "type": "object",
        "properties": {
            "name": { "type": "string" },
            "age": { "type": "integer" }
        },
        "required": ["name", "age"]
    }"#.to_string()));

// Output constrained by a Lark grammar (Foundry extension)
let client = model.create_chat_client()
    .response_format(ChatResponseFormat::LarkGrammar(grammar.to_string()));
```

### Audio Transcription


Transcribe audio files locally using the `AudioClient`:

```rust
let model = manager.catalog().get_model("whisper-tiny").await?;
model.load().await?;

let audio_client = model.create_audio_client()
    .language("en");

// Non-streaming transcription
let result = audio_client.transcribe("recording.wav").await?;
println!("{}", result.text);
```

#### Streaming Transcription


```rust
use tokio_stream::StreamExt;

let mut stream = audio_client.transcribe_streaming("recording.wav").await?;
while let Some(chunk) = stream.next().await {
    print!("{}", chunk?.text);
}
```

### Embedded Web Service


Start a local HTTP server that exposes an OpenAI-compatible REST API:

```rust
manager.start_web_service().await?;
let urls = manager.urls()?;
println!("Service running at: {:?}", urls);

// Any OpenAI-compatible client or tool can now connect to the endpoint.
// ...

manager.stop_web_service().await?;
```

### Chat Client Settings


All settings are configured via chainable builder methods on `ChatClient`:

| Method | Type | Description |
|--------|------|-------------|
| `temperature(v)` | `f64` | Sampling temperature (0.0–2.0; higher = more random) |
| `max_tokens(v)` | `u32` | Maximum number of tokens to generate |
| `top_p(v)` | `f64` | Nucleus sampling probability (0.0–1.0) |
| `top_k(v)` | `u32` | Top-k sampling parameter (Foundry extension) |
| `frequency_penalty(v)` | `f64` | Frequency penalty |
| `presence_penalty(v)` | `f64` | Presence penalty |
| `n(v)` | `u32` | Number of completions to generate |
| `random_seed(v)` | `u64` | Random seed for reproducible results (Foundry extension) |
| `response_format(v)` | `ChatResponseFormat` | Output format (Text, JsonObject, JsonSchema, LarkGrammar) |
| `tool_choice(v)` | `ChatToolChoice` | Tool selection strategy (None, Auto, Required, Function) |

## Error Handling


All fallible operations return `foundry_local_sdk::Result<T>`, which is an alias for `std::result::Result<T, FoundryLocalError>`.

```rust
use foundry_local_sdk::FoundryLocalError;

match manager.catalog().get_model("nonexistent").await {
    Ok(model) => { /* use model */ }
    Err(FoundryLocalError::ModelOperation { reason }) => {
        eprintln!("Model error: {reason}");
    }
    Err(FoundryLocalError::CommandExecution { reason }) => {
        eprintln!("Core engine error: {reason}");
    }
    Err(e) => {
        eprintln!("Unexpected error: {e}");
    }
}
```

### Error Variants


| Variant | Description |
|---------|-------------|
| `LibraryLoad { reason }` | The native core library could not be loaded |
| `CommandExecution { reason }` | A command executed against native core returned an error |
| `InvalidConfiguration { reason }` | The provided configuration is invalid |
| `ModelOperation { reason }` | A model operation failed (load, unload, download, etc.) |
| `HttpRequest(reqwest::Error)` | An HTTP request to an external service failed |
| `Serialization(serde_json::Error)` | JSON serialization/deserialization failed |
| `Validation { reason }` | A validation check on user-supplied input failed |
| `Io(std::io::Error)` | An I/O error occurred |
| `Internal { reason }` | An internal SDK error (e.g. poisoned lock) |

## Configuration


The SDK is configured via `FoundryLocalConfig` when creating the manager:

```rust
use foundry_local_sdk::{FoundryLocalConfig, LogLevel};

let config = FoundryLocalConfig::new("my_app")
    .log_level(LogLevel::Info)
    .model_cache_dir("/path/to/cache")
    .web_service_urls("http://127.0.0.1:5000");

let manager = FoundryLocalManager::create(config)?;
```

| Setting | Builder method | Default | Description |
|---------|---------------|---------|-------------|
| App name | `new(name)` | **(required)** | Your application name |
| App data dir | `.app_data_dir(dir)` | `~/.{app_name}` | Application data directory |
| Model cache dir | `.model_cache_dir(dir)` | `{app_data_dir}/cache/models` | Where models are stored locally |
| Logs dir | `.logs_dir(dir)` | `{app_data_dir}/logs` | Log output directory |
| Log level | `.log_level(level)` | `Warn` | `Trace`, `Debug`, `Info`, `Warn`, `Error`, `Fatal` |
| Web service URLs | `.web_service_urls(urls)` | `None` | Bind address for the embedded web service |
| Service endpoint | `.service_endpoint(url)` | `None` | URL of an existing external service to connect to |
| Library path | `.library_path(path)` | Auto-discovered | Path to native Foundry Local Core libraries |
| Additional settings | `.additional_setting(k, v)` | `None` | Extra key-value settings passed to Core |
| Logger | `.logger(impl Logger)` | `None` | Application logger (stub — not yet wired) |

## How It Works


### Native Library Download


The `build.rs` build script automatically downloads the required native libraries at compile time:

1. Queries NuGet/ORT-Nightly feeds for package metadata
2. Downloads `.nupkg` packages (zip archives)
3. Extracts platform-specific native libraries (`.dll`, `.so`, or `.dylib`)
4. Places them in Cargo's `OUT_DIR` for runtime discovery

Downloaded libraries are cached — subsequent builds skip the download step.

### Runtime Loading


At runtime, the SDK uses `libloading` to dynamically load the Foundry Local Core library and resolve function pointers. No static linking or system-wide installation is required.

## Platform Support


| Platform        | RID        | Status |
|-----------------|------------|--------|
| Windows x64     | `win-x64`  ||
| Windows ARM64   | `win-arm64`||
| Linux x64       | `linux-x64`||
| macOS ARM64     | `osx-arm64`||

## Running Examples


Sample applications are available in [`samples/rust/`](../../samples/rust/):

| Sample | Description |
|--------|-------------|
| `native-chat-completions` | Non-streaming and streaming chat completions |
| `tool-calling-foundry-local` | Function/tool calling with multi-turn conversations |
| `audio-transcription-example` | Audio transcription (non-streaming and streaming) |
| `foundry-local-webserver` | Embedded OpenAI-compatible REST API server |

Run a sample with:

```sh
cd samples/rust
cargo run -p native-chat-completions
```

## License


Microsoft Software License Terms — see [LICENSE](../../LICENSE) for details.