[](https://crates.io/crates/sigmd)
[](https://crates.io/crates/sigmd)
[](https://docs.rs/sigmd/latest/sigmd/)
[](https://github.com/vmi-rs/sigmd/blob/main/LICENSE)
# Windows API Signature Metadata
`sigmd` parses Windows SDK and [phnt] headers via libclang, mines SAL
annotations to derive function/method signatures and per-call buffer
length expressions, and emits an architecture-keyed metadata bundle for
tracers, hookers, and instrumentation tooling. The on-disk format is a
single [rkyv] archive. There is no deserialization step at load time,
just `mmap` and go.
## Features
- **Win32 SDK and NT API coverage.** Functions exported from `kernel32`,
`user32`, `advapi32`, ..., plus the NT-private surface (`Nt*`,
`Rtl*`, `Ldr*`, `NtUser*`, `NtGdi*`, ...) sourced from [phnt].
- **COM interfaces and methods.** `IUnknown`, `IDispatch`, all the way
through DXGI, D3D, Shell, ... UUIDs, base interface, full
method signatures.
- **SAL-derived buffer length expressions.** `_In_reads_bytes_(cb)`,
`_Out_writes_bytes_to_(max, *written)`, and friends are lowered into
a small expression IR (constants, parameter refs, return value,
`+`/`-`/`*`/`/`, pointer dereference) so a tracer knows how many
bytes to copy out of each pointer at call time.
- **Per-architecture sizeof resolution.** Each header is parsed twice
with `--target=i686-pc-windows-msvc` and
`--target=x86_64-pc-windows-msvc`, because `SIZE_T`, `ULONG_PTR`,
`LPARAM`, and others differ across architectures, and `sizeof(IDENT)`
inside SAL only resolves correctly under the right target.
- **Near-instant load.** Output is a ~45 MB rkyv archive read zero-copy
via `mmap`. An optional ~178 MB JSON mirror is available for human
inspection (`--json`).
- **Multiplatform.** Headers are fetched cross-platform via [xwin]. The
build itself runs anywhere libclang 18 or newer does (Linux, macOS,
Windows).
- **Configurable.** A YAML config selects header globs, include paths,
ignore lists, and custom-type mappings.
## Comparison with `win32metadata`
Microsoft's [win32metadata] is the canonical machine-readable Win32
surface, distributed as `Windows.Win32.winmd`. It's a great upgrade over
hand-maintained bindings, but for instrumentation work it has three
gaps:
- **No SAL.** `winmd` carries types but not buffer length annotations,
so a tracer still has to hand-author "for `ReadFile`, the byte count
is parameter 2, the buffer is parameter 1, the post-call length is
`*lpNumberOfBytesRead`" tables for every call it wants to capture.
- **No NT surface.** `Nt*`/`Rtl*`/`Ldr*`/`NtUser*` aren't part of
`win32metadata`. `sigmd` pulls them from [phnt].
- **Format.** `winmd` is a CLR metadata blob, and consumers need a CLR
reader. `sigmd` ships a flat rkyv archive and a small Rust crate to
navigate it. Roughly 2x the size, in exchange for SAL and NT.
## How it works
```
header.h
│
├─► clang --target=i686-pc-windows-msvc ──┐
│ (force-include inject.h: SAL → annotate)│
└─► clang --target=x86_64-pc-windows-msvc ─┤
▼
per-arch dedup
(high score wins,
__OVERRIDE / ignore_*)
│
▼
rkyv ──► metadata.bin (~45 MB)
└─► metadata.json (~178 MB, opt-in --json)
```
### Inject header
Every translation unit is compiled with
`-include assets/windows/include/inject.h`, which redefines the SAL
macros (`_In_*`, `_Out_*`, `_Inout_*`, `_COM_Outptr_*`, ...) to expand
into `__attribute__((annotate("__SAL:<counter>:<name>")))` markers.
Without this, libclang strips SAL during parsing. The `<counter>` comes
from `__COUNTER__` and dodges clang's content-based dedup of
`AnnotateAttr` across redeclarations (the "`gethostname` pathology"
where two redeclarations share an arg name and get collapsed into one).
### Double parse
Each header is compiled once for x86 and once for x64
because `sizeof(IDENT)` resolves to a different `Constant` between the
two. `SIZE_T` is 4 bytes on x86 and 8 on x64, and SAL expressions like
`_In_reads_bytes_(sizeof(SIZE_T))` need both answers.
### Score-based dedup
The same function appears in many headers
(kernel32 forwarder vs ntdll real, ANSI vs W variants in different SDK
headers, ...). For each duplicate name, `sigmd` keeps the variant with
the highest score:
```
score(function) =
sum over parameters:
+1 (always)
+10 if HAS_IN_ATTRIBUTE
+10 if HAS_OUT_ATTRIBUTE
+100 if the parameter declaration is not invalid
+100_000 if the function declaration is not invalid
+1_000_000 if the function carries the __OVERRIDE annotation
```
### Overrides
Some functions have broken or missing SAL: `gethostname`
ships with a SAL annotation that confuses every analyzer that reads it,
`Process32First` has none at all. `assets/windows/src/override/*.c`
provides hand-written replacements tagged with `__OVERRIDE`, so the
+1,000,000 term guarantees they win cross-TU arbitration over whatever
the SDK shipped.
### Undocumented APIs
APIs like `CreateProcessInternalA/W` aren't in
the public SDK at all. Replacements live in `assets/windows/src/undoc/*.c`
and feed the same parser.
### Buffer model
Each buffer descriptor carries a `direction`
(`Input`/`Output`), a `phase` (`Pre` = length known from input args
before the call, `Post` = length known from output args after the call),
and a `length` expression. The expression IR supports constants,
parameter references, the function's return value, `+`/`-`/`*`/`/`, and
pointer dereference. `sizeof(IDENT)` is resolved at build time using the
per-architecture sizeof table, so the wire format never needs a `SizeOf`
operator.
## Usage
**Prerequisites:** Rust 1.95+, libclang 18 or newer, and the [phnt]
submodule:
```sh
git submodule update --init
```
**Fetch the Windows SDK** (one-time, cross-platform via [xwin]):
```sh
./scripts/fetch-winsdk.sh --output winsdk
```
**Build the metadata bundle:**
```sh
cargo run --release -- --config sigmd.yaml build --output metadata.bin
```
Pass `--json metadata.json` to also emit a human-readable mirror. The
JSON output uses the `serde` feature, which is on by default through
the `cli` feature.
**As a library.** The crate doubles as a zero-copy reader for the
archive. Library consumers should disable default features to avoid
the CLI dependency tree (clang-sys, clap, rayon, indicatif, ...):
```toml
[dependencies]
sigmd = { version = "0.1", default-features = false }
```
```rust,ignore
use std::fs;
use sigmd::Database;
let bytes = fs::read("metadata.bin")?;
let db = Database::from_bytes(&bytes)?;
// Look up a function in the x64 metadata.
let func = db.x64().function("NtCreateFile").expect("NtCreateFile");
for param in func.parameters() {
println!(
" {:?}: {} (indir = {})",
param.name(),
param.ty().name(),
param.ty().indirections(),
);
}
// Walk output buffers. `length()` is the expression to evaluate
// against the call's arguments at trace time.
for buf in func.output_buffers() {
println!(" out param {} len = {:?}", buf.parameter(), buf.length());
}
```
## Configuration
```yaml
# Glob patterns for SDK headers to be included in the database.
sdk:
- winsdk/sdk/include/um/**/*.h
- winsdk/sdk/include/shared/**/*.h
- assets/windows/src/undoc/**/*.c
- assets/windows/src/override/**/*.c
- ...
# SDK header directories used for resolving includes.
# Passed to clang as `-isystem`.
include:
- winsdk/sdk/include/um
- assets/windows/include
- assets/windows/include/phnt
- ...
# Force-included headers.
# Passed to clang as `-include`.
inject:
- assets/windows/include/inject.h
database:
# Functions with these names will not be included in the database.
# Leading OR trailing wildcards are supported.
# For example, `*CreateFile` or `CreateFile*`.
ignore_functions:
# `RtlFreeAnsiString` has the same entry point as `RtlFreeUnicodeString`.
# Let's prefer `RtlFreeUnicodeString`, since the latter is more commonly
# used.
- RtlFreeAnsiString
# Types with these names will not be included in the database.
# Leading OR trailing wildcards are supported.
ignore_interfaces: []
# Custom type tags. See `TypeKind::Custom(u8)`.
custom_types:
- id: 1
name: ANSI_STRING
matches: [_ANSI_STRING, _STRING, _LSA_STRING]
```
The `id` in `custom_types` is part of the on-disk wire format. Never
reuse a retired id.
## TODO
- **WDK headers** (kernel-mode drivers, IRPs, `IO_STACK_LOCATION`, ...).
- **Linux / macOS** SDKs (libc, syscalls, kAPI surface).
- **Type information**: structs, unions, enums. `sigmd` currently
records only function and method signatures plus per-call buffer
descriptors. Layouts of the underlying types are not exported.
- **Stronger cross-TU dedup signals** beyond `is_invalid_declaration`
and `__OVERRIDE`.
## License
MIT.
<!-- readme end -->
[phnt]: https://github.com/winsiderss/phnt
[xwin]: https://github.com/Jake-Shadle/xwin
[rkyv]: https://github.com/rkyv/rkyv
[win32metadata]: https://github.com/microsoft/win32metadata