zeph 0.21.2 - Docs.rs

# Local Inference (Candle)

Run HuggingFace GGUF models locally via [candle](https://github.com/huggingface/candle) without external API dependencies. Metal and CUDA GPU acceleration are supported.

```bash
cargo build --release --features candle,metal  # macOS with Metal GPU
```

## Configuration

```toml
[llm]
provider = "candle"

[llm.candle]
source = "huggingface"
repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
chat_template = "mistral"          # llama3, chatml, mistral, phi3, raw
embedding_repo = "sentence-transformers/all-MiniLM-L6-v2"  # optional BERT embeddings

[llm.candle.generation]
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
repeat_penalty = 1.1
```

## Chat Templates

| Template | Models |
|----------|--------|
| `llama3` | Llama 3, Llama 3.1 |
| `chatml` | Qwen, Yi, OpenHermes |
| `mistral` | Mistral, Mixtral |
| `phi3` | Phi-3 |
| `raw` | No template (raw completion) |

## Device Auto-Detection

- **macOS** — Metal GPU (requires `--features metal`)
- **Linux with NVIDIA** — CUDA (requires `--features cuda`)
- **Fallback** — CPU

## Candle-Backed Classifiers

When built with the `classifiers` feature, Zeph uses Candle to run DeBERTa-based models directly for injection detection and PII detection — no external API calls required.

### Injection Detection (`CandleClassifier`)

`CandleClassifier` runs `protectai/deberta-v3-small-prompt-injection-v2` (sequence classification) to detect prompt injection attempts in incoming messages. When the model scores above `injection_threshold`, the message is flagged and existing injection-handling logic applies.

Long inputs are split into overlapping chunks (448 tokens each, 64-token overlap). The final score is the maximum across all chunks.

### PII Detection (`CandlePiiClassifier`)

`CandlePiiClassifier` runs `iiiorg/piiranha-v1-detect-personal-information` (NER token classification) to detect personal information in messages. Detected spans are merged with the existing regex-based PII filter — the union of both result sets is used.

Per-token confidence below `pii_threshold` is treated as O (no entity). Entity types include: `GIVENNAME`, `EMAIL`, `PHONE`, `DRIVERLICENSE`, `PASSPORT`, `IBAN`, and others as defined by the model.

### Configuration

```toml
[classifiers]
enabled = true                                            # Master switch (default: false)
timeout_ms = 5000                                        # Per-inference timeout in ms (default: 5000)
injection_model = "protectai/deberta-v3-small-prompt-injection-v2"
injection_threshold = 0.8                                # Minimum score to classify as injection (default: 0.8)
# injection_model_sha256 = "abc123..."                   # Optional: verify model file integrity at load
pii_enabled = true                                       # Enable NER PII detection (default: false)
pii_model = "iiiorg/piiranha-v1-detect-personal-information"
pii_threshold = 0.75                                     # Minimum per-token confidence (default: 0.75)
# pii_model_sha256 = "def456..."                         # Optional: verify model file integrity at load
```

**SHA-256 verification:** Set `injection_model_sha256` or `pii_model_sha256` to the hex digest of the model's safetensors file. Zeph verifies the file before loading and aborts startup on mismatch. Use this in security-sensitive deployments to detect corruption or tampering.

**Timeout fallback:** When an inference call exceeds `timeout_ms`, Zeph falls back to the existing regex-based detection. Classifiers never block the agent — degraded mode is always available.

**Model download:** Models are downloaded from HuggingFace on first use and cached locally. Subsequent startups load from cache. Set `injection_model` / `pii_model` to a custom HuggingFace repo ID to use alternative models with the same DeBERTa architecture.