precursor 0.2.3

Pre-protocol payload tagging, similarity clustering, and packet/firmware triage CLI.
# precursor

<p align="center">
  <img src="assets/logo/precursor-logo.svg" alt="Precursor logo" width="720">
</p>

<p align="center">
  <a href="https://precursor.hashdb.io">Live demos</a> ·
  <a href="https://github.com/Obsecurus/precursor/tree/main/samples/scenarios">Scenario corpus</a> ·
  <a href="https://crates.io/crates/precursor">Crates.io</a>
</p>

`precursor` is a CLI for **pre-protocol payload tagging + similarity clustering** across packet payloads, logs, and raw binary fragments.
It combines PCRE2 named-capture matching, TLSH/LZJD/FBHash similarity, optional MRSHv2 adapter mode, and JSON outputs designed for fast triage loops before full parser or protocol tooling exists.

## Release 0.2.1 Highlights

| Area | What landed |
| --- | --- |
| Packet Inference | Single-packet protocol scoring via `-P` / `-A` / `-k` |
| Blob + Binary Processing | `-z, --input-blob` and `-B, --input-binary` for multiline and raw-byte stream analysis |
| Similarity Workflows | TLSH, LZJD, or FBHash clustering + protocol hints (`--protocol-hints`) for discovery loops |
| Sigma Compatibility | `--sigma-rule` converts selectors into named PCRE captures and enforces Sigma `condition` logic |
| Regex Engine Scaffold | `--regex-engine pcre2|vectorscan` with compatibility diagnostics and PCRE2 fallback path |
| Output Contract | Stable `protocol_*`, `similarity_hash`, `tags`, `xxh3_64_sum` JSON fields |
| Reliability | Runtime ingest path no longer relies on panic-prone `expect(...)` calls |
| Scenario Corpus | Versioned packet/firmware/ICS + public PCAP/log/Sigma-derived samples in `samples/scenarios/` |
| Release Ops | Dependency auto-bump/tag workflows + benchmark harness + Pages site |

## 60-second teaser

```bash
cat samples/scenarios/pre-protocol-packet-triage/payloads.b64 \
  | precursor -p samples/scenarios/pre-protocol-packet-triage/patterns.pcre \
      -m base64 -t -d --similarity-mode lzjd -P --protocol-hints
```

Representative output shape:

```json
{
  "tags": ["http_method"],
  "similarity_hash": "lzjd:128:...",
  "protocol_label": "http",
  "protocol_confidence": 0.93,
  "protocol_candidates": [
    {"protocol": "http", "score": 0.93, "evidence": ["matched HTTP request/headers"]}
  ],
  "xxh3_64_sum": "..."
}
```

Why this matters:
- Fast pre-protocol triage when DPI/parsers are unavailable.
- Stable JSON fields for SOC pipelines and enrichment tooling.
- Built-in cluster context for LLM-assisted protocol discovery.

## Best Fit

- Rapid triage of payload lines from logs, brokers, sensors, or ad-hoc captures.
- Label-first workflows where regex capture names become downstream tags.
- Similarity-first clustering when full protocol parsers are unavailable or too brittle.
- Early-stage protocol discovery where hints are fed to humans or LLM tooling.

## Non-Goals

- Not a replacement for full IDS/NSM stacks (Suricata, Zeek).
- Not a malware rule engine replacement (YARA / YARA-X).
- Not a full protocol parser stack; this is pre-protocol triage and clustering.

## Architecture

<p align="center">
  <img src="architecture.svg" alt="Precursor architecture diagram" />
</p>

## Install

### Cargo

```bash
cargo install precursor
```

### From source

```bash
git clone https://github.com/Obsecurus/precursor.git
cd precursor
cargo build --release
./target/release/precursor --help
```

Release archives include generated shell completion files (`bash`, `fish`, `zsh`, `powershell`) and a `precursor.1` man page.

### Optional: Build with MRSHv2 adapter mode

```bash
mock_dir="$(mktemp -d)"
ci/build_mrshv2_mock.sh "$mock_dir"
PRECURSOR_MRSHV2_LIB_DIR="$mock_dir" cargo build --features similarity-mrshv2
```

## Quick start

### 1) Match string payloads from stdin

```bash
printf 'hello world\nbye world\n' \
  | precursor '(?<hello_tag>hello)' -m string
```

### 2) Match base64 payloads

```bash
printf 'aGVsbG8gd29ybGQ=\n' \
  | precursor '(?<hello_tag>hello)' -m base64
```

### 3) Load patterns from file

```bash
printf 'aGVsbG8gd29ybGQ=\n' \
  | precursor -p patterns/new -m base64
```

### 4) Extract payload from JSON before matching

```bash
printf '{"payload":"aGVsbG8gd29ybGQ="}\n' \
  | precursor '(?<hello_tag>hello)' -j '.payload' -m base64
```

### 5) Enable similarity diffing (TLSH default)

```bash
cat payloads.b64 \
  | precursor -p patterns/new -m base64 -t -d -x 80
```

### 6) Switch to LZJD similarity mode

```bash
cat payloads.raw \
  | precursor -p patterns/new -m string -t -d --similarity-mode lzjd -x 80
```

### 7) Switch to FBHash similarity mode

```bash
cat payloads.raw \
  | precursor -p patterns/new -m string -t -d --similarity-mode fbhash -x 90
```

### 8) Emit protocol-discovery hints for an LLM loop

```bash
cat payloads.b64 \
  | precursor -p patterns/new -m base64 -t -d --protocol-hints --protocol-hints-limit 20
```

### 9) Enable single-packet protocol inference output

```bash
cat payloads.b64 \
  | precursor -p patterns/new -m base64 -P -A 0.7 -k 5
```

### 10) Match a multiline payload as one blob

```bash
printf 'GET /blob HTTP/1.1\nHost: blob.example\n' \
  | precursor '(?<multi>GET /blob HTTP/1\.1\nHost: blob\.example)' -m string -z
```

### 11) Match a raw-binary blob (short flag)

```bash
printf '\x7fELF\x02\x01\x01\x00' \
  | precursor '(?<elf_magic>^\x7fELF)' -B
```

### 12) Load Sigma rules with condition gating

```bash
cat samples/scenarios/sigma-linux-shell-command-triage/payloads.log \
  | precursor --sigma-rule samples/scenarios/sigma-linux-shell-command-triage/sigma_rule.yml \
      -m string -t -d --similarity-mode lzjd
```

### 13) Run vectorscan compatibility mode (PCRE2 fallback)

```bash
cat payloads.raw \
  | precursor '(?<http_get>GET)' -m string --regex-engine vectorscan
```

### 14) Replay a real public Log4Shell PCAP (FBHash mode)

```bash
cat samples/scenarios/public-log4shell-foxit-pcap/payloads.string \
  | precursor -p samples/scenarios/public-log4shell-foxit-pcap/patterns.pcre \
      -m string -t -d --similarity-mode fbhash -P --protocol-hints
```

### 15) Triage public firmware blobs in binary folder mode

```bash
precursor -p samples/scenarios/public-firmware-binwalk-magic/patterns.pcre \
  -f samples/scenarios/public-firmware-binwalk-magic/blobs \
  --input-mode binary -t -d --similarity-mode lzjd -P --protocol-hints
```

## CLI reference

```text
precursor [PATTERN] [OPTIONS]
```

At least one pattern source is required: positional `PATTERN`, `--pattern-file`, or `--sigma-rule`.

Pattern source:
- positional `PATTERN` (single named-capture regex)
- `-p, --pattern-file <PATH>` (one named-capture pattern per line)
- `--sigma-rule <PATH>` (Sigma YAML selectors converted to named-capture PCRE patterns with `condition` enforcement)

Input:
- `-f, --input-folder <PATH>`: read newline-delimited content from files
- stdin: read newline-delimited input from standard input
- `-z, --input-blob`: process each input source as one blob instead of line splitting
- `-B, --input-binary`: treat each source as raw binary bytes (implies blob processing semantics)
- `-m, --input-mode <base64|string|hex|binary>`: decode mode (default: `base64`)
- `-j, --input-json-key <QUERY>`: extract payload from JSON input first

Similarity:
- `-t, --tlsh`: compute TLSH hash for matched payloads
- `-d, --tlsh-diff`: compute pairwise TLSH distance among matched payloads
- `-a, --tlsh-algorithm <48_1|128_1|128_3|256_1|256_3>`
- `-x, --tlsh-distance <N>`: max distance threshold (default: `100`)
- `-l, --tlsh-length`: include payload length in diff scoring
- `-y, --tlsh-sim-only`: only output payloads that have TLSH similarities
- `--similarity-mode <tlsh|lzjd|mrshv2|fbhash>`:
  - `tlsh`, `lzjd`, and `fbhash` are implemented in default builds
  - `mrshv2` is implemented behind `--features similarity-mrshv2` and native adapter linking
  - `fbhash` currently uses an in-tree FBHash-inspired chunk-vector model for stream-friendly pairwise scoring
- `--protocol-hints`: emit LLM-oriented protocol-discovery hint JSON to `stderr`
- `--protocol-hints-limit <N>`: limit hint candidate count (default: `25`)
- `-P, --single-packet`: enable heuristic protocol inference on each matched payload
- `-A, --abstain-threshold <0.0-1.0>`: minimum confidence required to emit a non-`unknown` label (default: `0.65`)
- `-k, --protocol-top-k <N>`: candidate count included in `protocol_candidates` (default: `3`)
- `--regex-engine <pcre2|vectorscan>`: regex engine selection (`vectorscan` mode emits compatibility checks and executes through current PCRE2 path)

Other:
- `-s, --stats`: emit run statistics JSON to `stderr`

## Output model

Each matched payload is emitted as JSON on `stdout` with fields such as:
- `tags`: array of matched capture names
- `tlsh`: active similarity hash when enabled (legacy field name preserved for compatibility)
- `similarity_hash`: backend-agnostic similarity hash field
- `xxh3_64_sum`: stable payload key for report correlation
- `tlsh_similarities`: distance map when `--tlsh-diff` is enabled
- `protocol_label`: top protocol guess (or `unknown` when abstaining)
- `protocol_confidence`: confidence score for `protocol_label`
- `protocol_abstained`: whether inference abstained under threshold
- `protocol_candidates`: scored candidate list with evidence strings
- `sigma_rule_matches`: Sigma rule titles whose `condition` evaluated true (when `--sigma-rule` is used)
- `sigma_rule_ids`: stable Sigma rule IDs/slugs that evaluated true

When `--stats` is enabled, a summary JSON object is emitted to `stderr`.
See `STATS.md` for schema, field meanings, and `jq` examples.
When `--protocol-hints` is enabled, an additional hint JSON block is emitted to `stderr` for LLM-guided protocol discovery workflows, including `protocol_*` fields when single-packet inference is enabled.
When both `--single-packet` and `--tlsh-diff` are enabled, protocol confidence is cluster-boosted using similarity neighbor counts.
When `--input-blob` is enabled (or `--input-binary` is set), each file/stdin stream is treated as a single candidate payload.

### Stats quick view

```bash
cat payloads.b64 \
  | precursor -p patterns/new -m base64 -t -d --similarity-mode lzjd --stats \
  1>/tmp/records.ndjson 2>/tmp/stats.json

jq '.Environment + {input_count: .Input.Count, similarities: .Compare.Similarities}' /tmp/stats.json
```

Notes:
- `Compare` may be empty when too few matched payloads produce pairwise distances.
- Historical record field names like `tlsh_similarities` are preserved for compatibility across TLSH/LZJD/FBHash modes.

## Positioning vs adjacent tools

- Use **Suricata/Zeek** for full protocol-aware IDS/NSM and rich ecosystem integrations.
- Use **YARA/YARA-X** for signature-based scanning of files and malware-centric workflows.
- Use **Sigma** for backend-agnostic detection content and SIEM portability.
- Use **Precursor** when you need lightweight payload tagging + similarity clustering, or when you want to run Sigma keyword intent directly against raw payload streams via `--sigma-rule`.

## Scenario corpus and demos

- Scenario corpus: `samples/scenarios/`
- Demo runner: `samples/scenarios/run_all.sh`
- Static demo site source: `site/`
- Includes packet/firmware/ICS plus public PCAP-derived Log4Shell probes, real fox-it Log4Shell PCAP replay, Sigma shell-command triage, Zeek DNS log triage, and real binwalk firmware blob samples.
- Site includes mini replay reels that visually walk through PCAP replay, firmware blob triage, and Sigma labeling behavior.
- `tshark` is only required when regenerating PCAP-derived payload extracts.

```bash
samples/scenarios/run_all.sh ./target/release/precursor
```

## Benchmarks

Generate a reproducible scenario snapshot:

```bash
cargo build --release
ci/benchmark_scenarios.sh ./target/release/precursor benchmarks/latest.md
```

Committed baseline:
- `benchmarks/baseline-2026-02-13.md`

## GitHub Pages + precursor.hashdb.io

- Pages workflow: `.github/workflows/pages.yml`
- Site content: `site/`
- Custom domain file: `site/CNAME`
- Configure DNS at your provider with:
  - record type: `CNAME`
  - host: `precursor`
  - value: `obsecurus.github.io`

## Current roadmap

See `ROADMAP.md` for prioritized milestones and release criteria.
See `SIMILARITY_BACKENDS.md` for MRSHv2/FBHash feasibility and backend sequencing.
See `SIGMA_INTEGRATION.md` for Sigma feature coverage and next steps.
See `HARDWARE_ACCELERATION.md` for regex acceleration/offload strategy.
See `STATS.md` for run-statistics schema and usage guidance.

## Development

```bash
cargo fmt --all --check
cargo test --workspace
mock_dir="$(mktemp -d)"
ci/build_mrshv2_mock.sh "$mock_dir" >/dev/null
PRECURSOR_MRSHV2_LIB_DIR="$mock_dir" LD_LIBRARY_PATH="$mock_dir:${LD_LIBRARY_PATH:-}" cargo test --workspace --features similarity-mrshv2
```

CI currently tests multiple toolchains and targets, including a pinned Rust `1.86.0` lane.
Dependabot is configured for weekly Cargo and GitHub Actions updates, with auto patch-version bump + auto-tag workflows so dependency updates can flow into release builds.

## Background

- GreyNoise blog: https://www.greynoise.io/blog/precursor-a-quantum-leap-in-arbitrary-payload-similarity-analysis
- GreyNoise Labs writeup: https://www.labs.greynoise.io/grimoire/2023-10-11-precursor/

## License

Dual-licensed under MIT and Unlicense.