precursor
precursor is a CLI for pre-protocol payload tagging + similarity clustering across packet payloads, logs, and raw binary fragments.
It combines PCRE2 named-capture matching, TLSH/LZJD/FBHash similarity, optional MRSHv2 adapter mode, and JSON outputs designed for fast triage loops before full parser or protocol tooling exists.
Release 0.2.1 Highlights
| Area | What landed |
|---|---|
| Packet Inference | Single-packet protocol scoring via -P / -A / -k |
| Blob + Binary Processing | -z, --input-blob and -B, --input-binary for multiline and raw-byte stream analysis |
| Similarity Workflows | TLSH, LZJD, or FBHash clustering + protocol hints (--protocol-hints) for discovery loops |
| Sigma Compatibility | --sigma-rule converts selectors into named PCRE captures and enforces Sigma condition logic |
| Regex Engine Scaffold | `--regex-engine pcre2 |
| Output Contract | Stable protocol_*, similarity_hash, tags, xxh3_64_sum JSON fields |
| Reliability | Runtime ingest path no longer relies on panic-prone expect(...) calls |
| Scenario Corpus | Versioned packet/firmware/ICS + public PCAP/log/Sigma-derived samples in samples/scenarios/ |
| Release Ops | Dependency auto-bump/tag workflows + benchmark harness + Pages site |
60-second teaser
|
Representative output shape:
Why this matters:
- Fast pre-protocol triage when DPI/parsers are unavailable.
- Stable JSON fields for SOC pipelines and enrichment tooling.
- Built-in cluster context for LLM-assisted protocol discovery.
Best Fit
- Rapid triage of payload lines from logs, brokers, sensors, or ad-hoc captures.
- Label-first workflows where regex capture names become downstream tags.
- Similarity-first clustering when full protocol parsers are unavailable or too brittle.
- Early-stage protocol discovery where hints are fed to humans or LLM tooling.
Non-Goals
- Not a replacement for full IDS/NSM stacks (Suricata, Zeek).
- Not a malware rule engine replacement (YARA / YARA-X).
- Not a full protocol parser stack; this is pre-protocol triage and clustering.
Architecture
Install
Cargo
From source
Release archives include generated shell completion files (bash, fish, zsh, powershell) and a precursor.1 man page.
Optional: Build with MRSHv2 adapter mode
mock_dir=""
PRECURSOR_MRSHV2_LIB_DIR=""
Quick start
1) Match string payloads from stdin
|
2) Match base64 payloads
|
3) Load patterns from file
|
4) Extract payload from JSON before matching
|
5) Enable similarity diffing (TLSH default)
|
6) Switch to LZJD similarity mode
|
7) Switch to FBHash similarity mode
|
8) Emit protocol-discovery hints for an LLM loop
|
9) Enable single-packet protocol inference output
|
10) Match a multiline payload as one blob
|
11) Match a raw-binary blob (short flag)
|
12) Load Sigma rules with condition gating
|
13) Run vectorscan compatibility mode (PCRE2 fallback)
|
14) Replay a real public Log4Shell PCAP (FBHash mode)
|
15) Triage public firmware blobs in binary folder mode
CLI reference
precursor [PATTERN] [OPTIONS]
At least one pattern source is required: positional PATTERN, --pattern-file, or --sigma-rule.
Pattern source:
- positional
PATTERN(single named-capture regex) -p, --pattern-file <PATH>(one named-capture pattern per line)--sigma-rule <PATH>(Sigma YAML selectors converted to named-capture PCRE patterns withconditionenforcement)
Input:
-f, --input-folder <PATH>: read newline-delimited content from files- stdin: read newline-delimited input from standard input
-z, --input-blob: process each input source as one blob instead of line splitting-B, --input-binary: treat each source as raw binary bytes (implies blob processing semantics)-m, --input-mode <base64|string|hex|binary>: decode mode (default:base64)-j, --input-json-key <QUERY>: extract payload from JSON input first
Similarity:
-t, --tlsh: compute TLSH hash for matched payloads-d, --tlsh-diff: compute pairwise TLSH distance among matched payloads-a, --tlsh-algorithm <48_1|128_1|128_3|256_1|256_3>-x, --tlsh-distance <N>: max distance threshold (default:100)-l, --tlsh-length: include payload length in diff scoring-y, --tlsh-sim-only: only output payloads that have TLSH similarities--similarity-mode <tlsh|lzjd|mrshv2|fbhash>:tlsh,lzjd, andfbhashare implemented in default buildsmrshv2is implemented behind--features similarity-mrshv2and native adapter linkingfbhashcurrently uses an in-tree FBHash-inspired chunk-vector model for stream-friendly pairwise scoring
--protocol-hints: emit LLM-oriented protocol-discovery hint JSON tostderr--protocol-hints-limit <N>: limit hint candidate count (default:25)-P, --single-packet: enable heuristic protocol inference on each matched payload-A, --abstain-threshold <0.0-1.0>: minimum confidence required to emit a non-unknownlabel (default:0.65)-k, --protocol-top-k <N>: candidate count included inprotocol_candidates(default:3)--regex-engine <pcre2|vectorscan>: regex engine selection (vectorscanmode emits compatibility checks and executes through current PCRE2 path)
Other:
-s, --stats: emit run statistics JSON tostderr
Output model
Each matched payload is emitted as JSON on stdout with fields such as:
tags: array of matched capture namestlsh: active similarity hash when enabled (legacy field name preserved for compatibility)similarity_hash: backend-agnostic similarity hash fieldxxh3_64_sum: stable payload key for report correlationtlsh_similarities: distance map when--tlsh-diffis enabledprotocol_label: top protocol guess (orunknownwhen abstaining)protocol_confidence: confidence score forprotocol_labelprotocol_abstained: whether inference abstained under thresholdprotocol_candidates: scored candidate list with evidence stringssigma_rule_matches: Sigma rule titles whoseconditionevaluated true (when--sigma-ruleis used)sigma_rule_ids: stable Sigma rule IDs/slugs that evaluated true
When --stats is enabled, a summary JSON object is emitted to stderr.
See STATS.md for schema, field meanings, and jq examples.
When --protocol-hints is enabled, an additional hint JSON block is emitted to stderr for LLM-guided protocol discovery workflows, including protocol_* fields when single-packet inference is enabled.
When both --single-packet and --tlsh-diff are enabled, protocol confidence is cluster-boosted using similarity neighbor counts.
When --input-blob is enabled (or --input-binary is set), each file/stdin stream is treated as a single candidate payload.
Stats quick view
|
Notes:
Comparemay be empty when too few matched payloads produce pairwise distances.- Historical record field names like
tlsh_similaritiesare preserved for compatibility across TLSH/LZJD/FBHash modes.
Positioning vs adjacent tools
- Use Suricata/Zeek for full protocol-aware IDS/NSM and rich ecosystem integrations.
- Use YARA/YARA-X for signature-based scanning of files and malware-centric workflows.
- Use Sigma for backend-agnostic detection content and SIEM portability.
- Use Precursor when you need lightweight payload tagging + similarity clustering, or when you want to run Sigma keyword intent directly against raw payload streams via
--sigma-rule.
Scenario corpus and demos
- Scenario corpus:
samples/scenarios/ - Demo runner:
samples/scenarios/run_all.sh - Static demo site source:
site/ - Includes packet/firmware/ICS plus public PCAP-derived Log4Shell probes, real fox-it Log4Shell PCAP replay, Sigma shell-command triage, Zeek DNS log triage, and real binwalk firmware blob samples.
- Site includes mini replay reels that visually walk through PCAP replay, firmware blob triage, and Sigma labeling behavior.
tsharkis only required when regenerating PCAP-derived payload extracts.
Benchmarks
Generate a reproducible scenario snapshot:
Committed baseline:
benchmarks/baseline-2026-02-13.md
GitHub Pages + precursor.hashdb.io
- Pages workflow:
.github/workflows/pages.yml - Site content:
site/ - Custom domain file:
site/CNAME - Configure DNS at your provider with:
- record type:
CNAME - host:
precursor - value:
obsecurus.github.io
- record type:
Current roadmap
See ROADMAP.md for prioritized milestones and release criteria.
See SIMILARITY_BACKENDS.md for MRSHv2/FBHash feasibility and backend sequencing.
See SIGMA_INTEGRATION.md for Sigma feature coverage and next steps.
See HARDWARE_ACCELERATION.md for regex acceleration/offload strategy.
See STATS.md for run-statistics schema and usage guidance.
Development
mock_dir=""
PRECURSOR_MRSHV2_LIB_DIR="" LD_LIBRARY_PATH=":"
CI currently tests multiple toolchains and targets, including a pinned Rust 1.86.0 lane.
Dependabot is configured for weekly Cargo and GitHub Actions updates, with auto patch-version bump + auto-tag workflows so dependency updates can flow into release builds.
Background
- GreyNoise blog: https://www.greynoise.io/blog/precursor-a-quantum-leap-in-arbitrary-payload-similarity-analysis
- GreyNoise Labs writeup: https://www.labs.greynoise.io/grimoire/2023-10-11-precursor/
License
Dual-licensed under MIT and Unlicense.