rig-llama-cpp 0.1.0

# rig-llama-cpp

A [Rig](https://github.com/0xPlaygrounds/rig) completion provider that runs GGUF models locally via [llama.cpp](https://github.com/ggml-org/llama.cpp) and their Rust bindings [llama-cpp-2](https://github.com/utilityai/llama-cpp-rs).

## Usage

```rust
use rig::client::CompletionClient;
use rig::completion::Prompt;
use rig_llama_cpp::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // The minimal form — every other knob has a sensible default. Chain
    // .n_ctx, .sampling, .fit, .kv_cache, .checkpoints, or (with the
    // `mtmd` feature) .mmproj to override.
    let client = Client::builder("path/to/model.gguf")
        .n_ctx(8192)
        .build()?;

    let agent = client
        .agent("local")
        .preamble("You are a helpful assistant.")
        .max_tokens(512)
        .build();

    let response = agent.prompt("Hello!").await?;
    println!("{response}");
    Ok(())
}
```

The legacy positional `Client::from_gguf(...)` constructor is still
available for callers pinned to the 0.1.x API.

## Features

- Local inference with any GGUF model
- Completion and streaming support
- Tool calling (for models with OpenAI-compatible chat templates)
- Reasoning / thinking output
- Vision (multimodal) inference for models with an `mmproj` projector — opt in via the `mtmd` feature
- Automatic GPU/CPU layer fitting — llama.cpp probes available device memory and picks `n_gpu_layers` for you, no manual tuning required
- Backend selection via Cargo feature flags
- Configurable sampling parameters (top-p, top-k, min-p, temperature, penalties)

## Feature Flags

No default GPU backend — pick the one that matches your hardware:

| Feature  | Use for                                                          |
| -------- | ---------------------------------------------------------------- |
| _(none)_ | CPU-only inference                                               |
| `vulkan` | Cross-vendor GPU on Linux/Windows                                |
| `cuda`   | NVIDIA GPUs                                                      |
| `metal`  | Apple Silicon / macOS                                            |
| `rocm`   | AMD GPUs on Linux                                                |
| `openmp` | OpenMP CPU threading; combine with any GPU backend               |
| `mtmd`   | Multimodal (vision) inference; enables `ClientBuilder::mmproj`   |

```sh
cargo build --features vulkan
cargo build --features "cuda,mtmd"
```

Toolchain and runtime requirements per backend are documented upstream
in [llama.cpp's build guide](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md).
A successful build does not guarantee a successful run — if backend
init fails at runtime, [`LoadError::BackendInit`] is returned rather
than panicking, so the application can fall back gracefully.

## Examples

```sh
MODEL_PATH=./model.gguf cargo run --example completion
MODEL_PATH=./model.gguf cargo run --example streaming
MODEL_PATH=./model.gguf cargo run --example stream_chat
MODEL_PATH=./model.gguf cargo run --example structured_output
MODEL_PATH=./model.gguf cargo run --example kv_cache
MODEL_PATH=./embedding-model.gguf cargo run --example embeddings

# Vision (requires mtmd feature + mmproj file)
MODEL_PATH=./vision-model.gguf MMPROJ_PATH=./mmproj.gguf IMAGE_PATH=./image.jpg \
    cargo run --features mtmd --example vision

# Hot-swap the loaded model on the same worker thread
RIG_MODEL_A=./model_a.gguf RIG_MODEL_B=./model_b.gguf cargo run --example reload
```

`N_GPU_LAYERS=20` can be used to offload 20 layers to the GPU.

By default, llama.cpp backend logs are suppressed so streaming and test output stay readable.
Set `RIG_LLAMA_CPP_LOGS=1` to re-enable raw backend logs when debugging model startup or decode issues.

## Testing

```sh
# Fast unit tests + doctests — no model required, run on every CI build.
cargo test --lib
cargo test --doc
```

The full integration suite (`tests/e2e/`) covers streaming completions,
vision, tool roundtrips, structured output, KV-cache quantization,
embedding, and sequential model reload. All tests are `#[ignore]`d and
auto-download their fixtures via `hf-hub` into the standard HuggingFace
cache (`~/.cache/huggingface/hub`) on first run — plan for ~20 GB.
Backend compilation is already covered upstream by `llama-cpp-rs`, and
the model fixtures are too large for hosted runners, so the e2e suite
does not run in CI.

```sh
cargo test --test e2e --features mtmd -- --ignored --nocapture
```

## Contributing

Issues and pull requests are welcome at
[github.com/camperking/rig-llama-cpp](https://github.com/camperking/rig-llama-cpp).

Before opening a PR, please run the same checks CI does
([`.github/workflows/ci.yml`](.github/workflows/ci.yml)):

```sh
cargo fmt --all --check
cargo clippy --no-deps --all-targets -- -D warnings
cargo clippy --no-deps --all-targets --features mtmd -- -D warnings
cargo test --lib
cargo test --doc
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --features mtmd
```

If your change touches inference behaviour, validate it locally with
`cargo test --test e2e --features mtmd -- --ignored --nocapture` — the
fixtures auto-download on first run (~20 GB; see the
[Testing](#testing) section).

For changes that affect the public API or the embedded `llama-cpp-2`
version, add an entry to [`CHANGELOG.md`](CHANGELOG.md) under
`[Unreleased]`. The crate's pre-1.0 SemVer policy is documented at the
top of that file.

## License

See [Cargo.toml](Cargo.toml) for dependency details.