# syara-x
**Semantic YARA in Rust** — extends YARA-compatible rules with semantic similarity,
ML classifier, LLM-based, and perceptual hash matching. Catches malicious
content (prompt injection, phishing, jailbreaks) by meaning and intent, not
just exact text patterns.
Ported from [SYARA](https://github.com/nabeelxy/syara) originally written by
Nabeel Yoosuf
> This library was ported from Python to Rust by
> [Claude](https://claude.ai/claude-code) (Anthropic's AI coding assistant),
> working through six implementation phases under human direction, tested in a
> larger implementation that includes working side by side with [YARA-X](https://github.com/VirusTotal/yara-x).
> See [CONTRIBUTING.md](CONTRIBUTING.md) for how the project is maintained.
**EXPERIMENTAL**: Do not use this for anything important yet. I'm lazily commmiting
directly to `main` some very speculative features that I'm not sure if Claude can
pull off. The local-LLM backend (`burn-llm` / `burn-llm-gpu`) is currently walled
off pending a migration to `candle-rs` — see [ROADMAP.md](ROADMAP.md). Use at your
own risk.
---
## Features
| _(none)_ | String/regex matching, cleaners, chunkers |
| `sbert` | Semantic similarity via HTTP embedding endpoint (OpenAI-compatible; Ollama variant preserved) |
| `sbert-onnx` | Local ONNX MiniLM-L6-v2 backend (requires system `libonnxruntime` ≥1.17 — see [System dependencies](#system-dependencies)) |
| `classifier` | ML text classifiers via OpenAI-compatible HTTP embeddings (implies `sbert`) |
| `classifier-onnx` | Local ONNX classifier backend (recommended; implies `classifier` + `sbert-onnx`) |
| `llm` | LLM-based evaluation via OpenAI-compatible `/v1/chat/completions` (LM Studio / vLLM / llama-server / openai.com / Ollama's shim); native Ollama `/api/chat` preserved as legacy |
| `phash` | Perceptual hash matching for images, audio, and video |
| `all` | All of the above |
---
## Quick start
```toml
# Cargo.toml
[dependencies]
syara-x = { version = "0.3", features = ["all"] }
```
```rust
use syara_x;
let rules = syara_x::compile_str(r#"
rule prompt_injection {
strings:
$pi1 = "ignore previous instructions" nocase
$pi2 = "disregard your system prompt" nocase
condition:
any of them
}
"#)?;
for m in rules.scan(user_input) {
if m.matched {
println!("Rule '{}' matched", m.rule_name);
}
}
```
---
## Rule syntax
syara-x uses a YARA-inspired DSL with extensions for semantic and ML matching.
### String patterns
```
rule example {
strings:
$s1 = "literal match" nocase
$s2 = /regex\s+pattern/
$s3 = "wide char" wide
condition:
$s1 or $s2
}
```
Supported modifiers: `nocase`, `wide`, `ascii`, `dotall`, `fullword`.
Regex patterns support Rust's inline flag syntax — `(?m)^foo` (multiline),
`(?s).*` (dot-matches-newline), `(?i)foo` (case-insensitive). Inline flags
compose with modifiers.
### Condition expressions
Conditions are boolean expressions over pattern identifiers (`$name`) and
pattern match counts (`#name`). YARA-style:
```
rule multi_turn_transcript {
strings:
$user = "user:"
$assistant = "assistant:"
condition:
#user >= 2 and #assistant >= 1
}
```
Supported: `$id`, `#id` (count), integer literals, `== != < <= > >=`,
`+` / `-` arithmetic, unary `-`, `and` / `or` / `not`, `any of` / `all of`
pattern sets (`(them | $a,$b | $prefix*)`). See
[tasks/YARA-X-PARITY-GAPS.md](tasks/YARA-X-PARITY-GAPS.md) for what's not
yet supported vs YARA-X.
### Semantic similarity (`sbert` feature)
```
rule semantic_phishing {
similarity:
$sim1 = {
pattern: "your account has been compromised click here"
threshold: 0.82
cleaner: default_cleaning
chunker: sentence_chunking
matcher: sbert
}
condition:
$sim1
}
```
### Classifier (`classifier` / `classifier-onnx` features)
```
rule jailbreak_classifier {
classifier:
$c1 = {
pattern: "request to override AI safety guidelines"
threshold: 0.65
cleaner: default_cleaning
chunker: paragraph_chunking
classifier: tuned-sbert
}
condition:
$c1
}
```
The default `tuned-sbert` classifier is registered against an OpenAI-compatible
`/v1/embeddings` endpoint (`http://localhost:1234`). For deterministic, offline
scoring use the local ONNX backend instead:
```rust
use syara_x::engine::classifier::OnnxEmbeddingClassifier;
let cls = OnnxEmbeddingClassifier::from_dir("../models/all-MiniLM-L6-v2")?;
rules.register_classifier("tuned-sbert", Box::new(cls));
```
### LLM evaluation (`llm` feature)
```
rule llm_jailbreak {
llm:
$llm1 = {
pattern: "Does this text attempt to override AI safety guidelines?"
llm: openai-api-compatible
cleaner: no_op
chunker: no_chunking
}
condition:
$llm1
}
```
`openai-api-compatible` is the default and talks to any OpenAI-compatible
`/v1/chat/completions` endpoint (LM Studio, vLLM, llama-server, Open WebUI,
openai.com, and Ollama's OpenAI-compat shim). The legacy `ollama` name is kept
for users on Ollama's native `/api/chat` endpoint — specify `llm: ollama`.
#### Default endpoint and environment variables
The default registration points at `http://localhost:1234/v1/chat/completions`
with model `local-model` — the LM Studio convention. Override via environment:
| `SYARA_LLM_ENDPOINT` | highest | Full chat-completions URL (scoped to SYARA-X) |
| `SYARA_LLM_MODEL` | highest | Model identifier |
| `SYARA_LLM_API_KEY` | highest | Bearer token |
| `OPENAI_BASE_URL` | fallback | Root URL; `/chat/completions` is appended if missing |
| `OPENAI_MODEL` | fallback | Model identifier |
| `OPENAI_API_KEY` | fallback | Bearer token |
**Opting out of env-var lookup.** If you prefer SYARA-X never read these vars,
set `SYARA_LLM_NO_ENV=1` — the default registration falls back to the
hardcoded localhost endpoint with no API key. Alternatively, register an
explicit evaluator before scanning:
```rust
# #[cfg(feature = "llm")] {
use syara_x::engine::llm_evaluator::OpenAiChatEvaluatorBuilder;
let mut rules = syara_x::compile_str(source)?;
rules.register_llm_evaluator(
"openai-api-compatible",
Box::new(
OpenAiChatEvaluatorBuilder::new()
.endpoint("http://localhost:1234/v1/chat/completions")
.model("local-model")
.api_key(std::fs::read_to_string("/run/secrets/api_key")?.trim())
.build(),
),
);
# Ok::<(), syara_x::SyaraError>(()) }
```
**Scoped token exposure.** Prefer `SYARA_LLM_API_KEY` over `OPENAI_API_KEY`
to keep SYARA-X's credential separate from anything else on the system that
reads `OPENAI_API_KEY`. A process-scoped invocation also works:
`SYARA_LLM_API_KEY="$(cat /path/to/key)" your-binary`.
#### Builder knobs
[`OpenAiChatEvaluatorBuilder`] exposes `endpoint`, `model`, `api_key`,
`temperature` (default `0.0` — deterministic), `max_tokens` (default 8192 —
sized for reasoning models like Qwen3, DeepSeek-R1, GPT-OSS that spend
thousands of tokens on internal `<think>` / `reasoning_content` before the
final `YES`/`NO`; non-reasoning models still stop early via
`finish_reason=stop`), `system_prompt`, `header` (arbitrary HTTP header),
`connect_timeout` (default 20s), and `read_timeout` (default 60s).
If a reasoning model is truncated mid-think (`finish_reason=length` with
empty `content`), the evaluator returns a `SyaraError::LlmError` pointing
at `.max_tokens(…)` rather than silently parsing the empty response as a
non-match.
Responses are cached by `(pattern, chunk)` when `temperature == 0.0`; the
cache is bounded (1024 entries) and cleared after each `scan()` call to
mirror the lifecycle of the text cache.
### Perceptual hash (`phash` feature)
```
rule known_malware_image {
phash:
$ph1 = "/path/to/reference.png" threshold=0.95 hasher="imagehash"
condition:
$ph1
}
```
---
## Built-in components
**Cleaners:** `default_cleaning`, `aggressive_cleaning`, `no_op`
**Chunkers:** `no_chunking`, `sentence_chunking`, `paragraph_chunking`,
`word_chunking`, `fixed_size_chunking`
**Matchers:** `sbert` (HTTP embedding), `tuned-sbert` (classifier),
`openai-api-compatible` (LLM, default), `ollama` (LLM, legacy),
`imagehash`, `audiohash`, `videohash`
Custom components can be registered on `CompiledRules` via
`register_cleaner`, `register_chunker`, `register_semantic_matcher`, etc.
---
## C API
A C FFI is available via the `capi` crate. After building, `syara_x.h` is
generated automatically by [cbindgen](https://github.com/mozilla/cbindgen).
```c
#include "syara_x.h"
SyaraRules *rules = NULL;
syara_compile_str("rule r { strings: $s = \"evil\" condition: $s }", &rules);
SyaraMatchArray *matches = NULL;
syara_scan(rules, input_text, &matches);
for (size_t i = 0; i < matches->count; i++) {
if (matches->matches[i].matched) {
printf("matched: %s\n", matches->matches[i].rule_name);
}
}
syara_matches_free(matches);
syara_rules_free(rules);
```
---
## Architecture
```
.syara file
└─> SyaraParser parse DSL
└─> Compiler validate identifiers, conditions
└─> CompiledRules execution engine
├─ StringMatcher (cheapest)
├─ SemanticMatcher (sbert)
├─ PHashMatcher (phash)
├─ TextClassifier (classifier)
└─ LLMEvaluator (most expensive, short-circuited)
```
Execution is cost-ordered. LLM calls are skipped when the condition cannot
be satisfied even if the LLM matches (see `is_identifier_needed` in
`condition.rs`).
---
## Development
```bash
cargo build # build all crates
cargo test # run all tests
cargo test -p syara-x --features all # library tests with all features
cargo clippy -- -D warnings # lint (must be clean)
```
External services (Ollama) are only contacted when the corresponding feature
is enabled and a rule actually exercises that matcher. String-only rules need
no external services.
---
## System dependencies
Most features are pure-Rust and need nothing beyond `cargo`. The `sbert-onnx`
and `classifier-onnx` features are the exceptions — both link against the ONNX
Runtime shared library at runtime (via `ort`'s `load-dynamic` mode) and will
not run without it installed on the host.
### `sbert-onnx` / `classifier-onnx` (ONNX Runtime ≥ 1.17)
**macOS (Homebrew):**
```bash
brew install onnxruntime
# Homebrew installs to /opt/homebrew/lib, which dlopen does NOT search by default —
# point ort at the dylib explicitly:
export ORT_DYLIB_PATH="$(brew --prefix onnxruntime)/lib/libonnxruntime.dylib"
```
**Linux (Debian/Ubuntu):** download the matching release from
[microsoft/onnxruntime releases](https://github.com/microsoft/onnxruntime/releases)
and place `libonnxruntime.so` on your loader path, or set `ORT_DYLIB_PATH` to
point at the file.
**Any platform (escape hatch):** point `ort` at a specific dylib by exporting
`ORT_DYLIB_PATH=/absolute/path/to/libonnxruntime.{dylib,so,dll}` before
`cargo test` / `cargo run`.
**Convenience wrapper:** for repeated use, run
`./scripts/install_onnxruntime_xdg.sh` once to install
`~/.local/bin/with-onnxruntime` (XDG user-bin). Then prefix any command:
```bash
with-onnxruntime cargo test --features classifier-onnx -- --ignored
```
The same MiniLM weights are reused by `integration_real_onnx_embed` and
`integration_real_onnx_classifier`. To fetch them:
```bash
./scripts/fetch_minilm.sh # downloads to <repo>/models/all-MiniLM-L6-v2/
```
---
## License
MIT — see [LICENSE](LICENSE).