Overview
skill-veil is an open source static analysis and policy tool for the agent extension supply chain.
It helps answer a narrow but useful operational question:
should this skill, prompt pack, instruction file, MCP manifest, or related artifact be allowed, reviewed, or blocked before it lands in a repo or CI pipeline?
It is strongest as a static security and policy layer, not as a universal malware engine.
Key Features
| Feature | Description |
|---|---|
| Agent Extension Coverage | First-class support for SKILL.md, AGENTS.md, CLAUDE.md, SYSTEM.md, prompt packs, and MCP manifests |
| Artifact Analysis | Inspects referenced scripts, manifests, lockfiles, Docker artifacts, and operational configs |
| Policy Engine | log, require_approval, block with profiles, waivers, baselines, and overrides |
| CI-Friendly Output | Text, JSON, SARIF, SHIELD, diff mode, compact CI summary, and PR gating support |
| External Rule Packs | Versioned official and community rule packs with fixtures and validation |
| Benchmarking | Labeled corpus, confidence calibration, threshold tuning, and release history dashboard |
What It Detects
Behavior Remote execution, install hooks, deferred execution, persistence
Supply Chain Unpinned dependencies, missing lockfiles, remote MCP endpoints
Prompt Risk Persistent instruction tampering, cognitive rootkits, prompt packs
Tooling Risk Tool abuse, autonomy escalation, approval bypass patterns
Runtime Risk Privileged containers, host mounts, process execution, secret access
Artifacts package.json, requirements.txt, pyproject.toml, Cargo.toml,
Dockerfile, docker-compose, lockfiles, Makefile, .npmrc, pip.conf
Installation
From Source
From a GitHub Release
# Example
Full installation notes: docs/installation.md
Quick Start
# Scan a strict entrypoint
# Scan a package with manifests and related artifacts
# Scan agent-extension targets beyond SKILL.md
Usage
Command Line Interface
# Auto scan
# Strict explicit-entrypoint scan
# Package scan
# Dataset / marketplace / monorepo mode
Common Commands
| Command | Description |
|---|---|
scan |
Auto-discover and scan files or directories |
scan-file |
Scan a strict explicit entrypoint |
scan-package |
Scan a package without promoting docs to entrypoints |
scan-dataset |
Scan many packages in a repo, dataset, or marketplace mirror |
benchmark |
Run the labeled benchmark corpus |
baseline create |
Create a baseline from a JSON report |
baseline update |
Update a baseline safely |
waivers validate |
Validate waiver configuration |
diff |
Compare two JSON reports with baseline/waiver awareness |
rules validate |
Validate external rule packs |
rules test |
Test one rule against inline content |
rules test-pack |
Run pack fixtures |
rules pack-info |
Summarize external rule packs |
policy validate |
Validate a policy file |
Useful Options
| Option | Description |
|---|---|
--format text/json/sarif/shield |
Output format |
--preset local/ci/strict/enterprise |
Apply output and policy presets |
--quiet-summary |
Compact text output |
--explain-policy |
Focus on policy reasoning instead of finding details |
--baseline |
Accepted findings baseline |
--waivers |
Waiver file |
--policy |
Policy file |
--ci-summary |
Compact diff summary for CI |
--fail-on <mode> |
CI diff failure mode (new-active or new-blocking) |
--dashboard-output |
Write benchmark history dashboard |
Examples
Review a suspicious package
Generate a report for CI
Baseline + diff workflow
Benchmark with history and dashboard
Rule pack development
Optional YARA support
YARA usage notes and an example rule live in:
External dataset validation
For marketplace mirrors or local corpora that are intentionally kept out of Git:
Curated example packages
- safe skill:
examples/safe-skill/ - suspicious skill:
examples/suspicious-skill/ - malicious skill:
examples/malicious-skill/ - manifest-heavy package:
examples/manifest-package/ - referenced script package:
examples/referenced-script-package/ - agent instructions:
examples/agent-instructions/ - prompt pack:
examples/prompt-pack/ - MCP manifest:
examples/mcp-server/
Daily analyst triage
That view is intentionally short and stable for daily review:
- package id
- verdict
- package health
- blast radius
- top rule
- strongest scope/reason
Use Cases
1. Review a third-party skill before installing it
Use this when someone shares a SKILL.md, AGENTS.md, or similar entrypoint
and you want a fast local decision.
What you get:
- findings grouped by severity and category
- a final action:
log,require_approval, orblock - policy escalation reasons if the artifact implies extra blast radius
2. Review a whole package, not only the root document
Use this when a skill repo also contains manifests, install hooks, scripts, or container files.
This is the most important mode for real reviews because it inspects:
- the explicit entrypoint
- referenced scripts
- manifests and lockfiles
- Docker and runtime artifacts
3. Scan agent instruction files and prompt packs
Use this when the risky part is not a classic skill but a persistent instruction surface.
This is useful for:
- persistent prompt tampering
- cognitive rootkits
- approval bypass patterns
- prompt-pack review before publishing or importing
4. Review an MCP manifest before enabling a server
Use this when you want to inspect an MCP server descriptor for remote connectivity, command execution, or tool-scope concerns.
5. Add a CI gate to block only new active findings
Use this when you already have accepted debt and only want to stop regressions.
This is the practical workflow for teams because it separates:
- existing accepted findings
- waived findings
- new active findings
6. Manage accepted risk with baseline and waivers
Use this when some findings are known and reviewed, but you still want the tool to stay strict about new ones.
7. Scan a catalog, dataset, or marketplace mirror
Use this when you have many packages and want aggregate review instead of single-file analysis.
This is the right mode for:
- internal marketplaces
- downloaded skill corpora
- large monorepos of agent extensions
8. Measure whether the scanner got better or worse
Use this when changing rules, scoring, or analyzers.
This tells you:
- precision and recall
- false positive rate
- exact label accuracy
- confidence calibration
- threshold recommendations
- release-to-release trend
Output Formats
| Format | Use Case |
|---|---|
text |
Local review |
json |
Automation, baselines, diff, dashboards |
sarif |
GitHub Code Scanning |
shield |
Policy-oriented markdown |
Benchmarking
The repository ships with a labeled benchmark corpus and release history.
Current benchmark reporting includes:
- precision
- recall
- false positive rate
- accuracy
- exact label accuracy
- TP / FP / TN / FN
- corpus coverage by label and focus category
- confidence calibration by evidence, category, and signal pair
- threshold recommendations
- markdown dashboard for release-to-release comparison
Methodology: docs/benchmark-methodology.md
Rule Packs
External versioned packs under rules/official/ are the primary default rule
source. Embedded rules are a fallback only.
Rule pack docs:
Documentation
- docs/architecture.md
- docs/changelog.md
- docs/roadmap.md
- docs/threat-model.md
- docs/usage-local.md
- docs/usage-ci.md
- docs/agent-extensions.md
- docs/policy-model.md
- docs/policy-presets.md
- docs/finding-model.md
- docs/verdict-model.md
- docs/analyst-interpretation.md
- docs/json-report-schema-v3.md
- docs/artifact-analysis.md
- docs/release-process.md
Contributing
Contributions are welcome.
Start here:
Support the Project
If skill-veil is useful to you, consider supporting its maintenance:
License
This project is licensed under the MIT License. See LICENSE.
Attribution:
- Repository: github.com/seifreed/skill-veil