Windows API Signature Metadata
sigmd parses Windows SDK and phnt headers via libclang, mines SAL
annotations to derive function/method signatures and per-call buffer
length expressions, and emits an architecture-keyed metadata bundle for
tracers, hookers, and instrumentation tooling. The on-disk format is a
single rkyv archive. There is no deserialization step at load time,
just mmap and go.
Features
-
Win32 SDK and NT API coverage. Functions exported from
kernel32,user32,advapi32, ..., plus the NT-private surface (Nt*,Rtl*,Ldr*,NtUser*,NtGdi*, ...) sourced from phnt. -
COM interfaces and methods.
IUnknown,IDispatch, all the way through DXGI, D3D, Shell, ... UUIDs, base interface, full method signatures. -
SAL-derived buffer length expressions.
_In_reads_bytes_(cb),_Out_writes_bytes_to_(max, *written), and friends are lowered into a small expression IR (constants, parameter refs, return value,+/-/*//, pointer dereference) so a tracer knows how many bytes to copy out of each pointer at call time. -
Per-architecture sizeof resolution. Each header is parsed twice with
--target=i686-pc-windows-msvcand--target=x86_64-pc-windows-msvc, becauseSIZE_T,ULONG_PTR,LPARAM, and others differ across architectures, andsizeof(IDENT)inside SAL only resolves correctly under the right target. -
Near-instant load. Output is a ~45 MB rkyv archive read zero-copy via
mmap. An optional ~178 MB JSON mirror is available for human inspection (--json). -
Multiplatform. Headers are fetched cross-platform via xwin. The build itself runs anywhere libclang 18 or newer does (Linux, macOS, Windows).
-
Configurable. A YAML config selects header globs, include paths, ignore lists, and custom-type mappings.
Comparison with win32metadata
Microsoft's win32metadata is the canonical machine-readable Win32
surface, distributed as Windows.Win32.winmd. It's a great upgrade over
hand-maintained bindings, but for instrumentation work it has three
gaps:
- No SAL.
winmdcarries types but not buffer length annotations, so a tracer still has to hand-author "forReadFile, the byte count is parameter 2, the buffer is parameter 1, the post-call length is*lpNumberOfBytesRead" tables for every call it wants to capture. - No NT surface.
Nt*/Rtl*/Ldr*/NtUser*aren't part ofwin32metadata.sigmdpulls them from phnt. - Format.
winmdis a CLR metadata blob, and consumers need a CLR reader.sigmdships a flat rkyv archive and a small Rust crate to navigate it. Roughly 2x the size, in exchange for SAL and NT.
How it works
header.h
│
├─► clang --target=i686-pc-windows-msvc ──┐
│ (force-include inject.h: SAL → annotate)│
└─► clang --target=x86_64-pc-windows-msvc ─┤
▼
per-arch dedup
(high score wins,
__OVERRIDE / ignore_*)
│
▼
rkyv ──► metadata.bin (~45 MB)
└─► metadata.json (~178 MB, opt-in --json)
Inject header
Every translation unit is compiled with
-include assets/windows/include/inject.h, which redefines the SAL
macros (_In_*, _Out_*, _Inout_*, _COM_Outptr_*, ...) to expand
into __attribute__((annotate("__SAL:<counter>:<name>"))) markers.
Without this, libclang strips SAL during parsing. The <counter> comes
from __COUNTER__ and dodges clang's content-based dedup of
AnnotateAttr across redeclarations (the "gethostname pathology"
where two redeclarations share an arg name and get collapsed into one).
Double parse
Each header is compiled once for x86 and once for x64
because sizeof(IDENT) resolves to a different Constant between the
two. SIZE_T is 4 bytes on x86 and 8 on x64, and SAL expressions like
_In_reads_bytes_(sizeof(SIZE_T)) need both answers.
Score-based dedup
The same function appears in many headers
(kernel32 forwarder vs ntdll real, ANSI vs W variants in different SDK
headers, ...). For each duplicate name, sigmd keeps the variant with
the highest score:
score(function) =
sum over parameters:
+1 (always)
+10 if HAS_IN_ATTRIBUTE
+10 if HAS_OUT_ATTRIBUTE
+100 if the parameter declaration is not invalid
+100_000 if the function declaration is not invalid
+1_000_000 if the function carries the __OVERRIDE annotation
Overrides
Some functions have broken or missing SAL: gethostname
ships with a SAL annotation that confuses every analyzer that reads it,
Process32First has none at all. assets/windows/src/override/*.c
provides hand-written replacements tagged with __OVERRIDE, so the
+1,000,000 term guarantees they win cross-TU arbitration over whatever
the SDK shipped.
Undocumented APIs
APIs like CreateProcessInternalA/W aren't in
the public SDK at all. Replacements live in assets/windows/src/undoc/*.c
and feed the same parser.
Buffer model
Each buffer descriptor carries a direction
(Input/Output), a phase (Pre = length known from input args
before the call, Post = length known from output args after the call),
and a length expression. The expression IR supports constants,
parameter references, the function's return value, +/-/*//, and
pointer dereference. sizeof(IDENT) is resolved at build time using the
per-architecture sizeof table, so the wire format never needs a SizeOf
operator.
Usage
Prerequisites: Rust 1.95+, libclang 18 or newer, and the phnt submodule:
Fetch the Windows SDK (one-time, cross-platform via xwin):
Build the metadata bundle:
Pass --json metadata.json to also emit a human-readable mirror. The
JSON output uses the serde feature, which is on by default through
the cli feature.
As a library. The crate doubles as a zero-copy reader for the archive. Library consumers should disable default features to avoid the CLI dependency tree (clang-sys, clap, rayon, indicatif, ...):
[]
= { = "0.1", = false }
use fs;
use Database;
let bytes = read?;
let db = from_bytes?;
// Look up a function in the x64 metadata.
let func = db.x64.function.expect;
for param in func.parameters
// Walk output buffers. `length()` is the expression to evaluate
// against the call's arguments at trace time.
for buf in func.output_buffers
Configuration
# Glob patterns for SDK headers to be included in the database.
sdk:
- winsdk/sdk/include/um/**/*.h
- winsdk/sdk/include/shared/**/*.h
- assets/windows/src/undoc/**/*.c
- assets/windows/src/override/**/*.c
- ...
# SDK header directories used for resolving includes.
# Passed to clang as `-isystem`.
include:
- winsdk/sdk/include/um
- assets/windows/include
- assets/windows/include/phnt
- ...
# Force-included headers.
# Passed to clang as `-include`.
inject:
- assets/windows/include/inject.h
database:
# Functions with these names will not be included in the database.
# Leading OR trailing wildcards are supported.
# For example, `*CreateFile` or `CreateFile*`.
ignore_functions:
# `RtlFreeAnsiString` has the same entry point as `RtlFreeUnicodeString`.
# Let's prefer `RtlFreeUnicodeString`, since the latter is more commonly
# used.
- RtlFreeAnsiString
# Types with these names will not be included in the database.
# Leading OR trailing wildcards are supported.
ignore_interfaces:
# Custom type tags. See `TypeKind::Custom(u8)`.
custom_types:
- id: 1
name: ANSI_STRING
matches:
The id in custom_types is part of the on-disk wire format. Never
reuse a retired id.
TODO
- WDK headers (kernel-mode drivers, IRPs,
IO_STACK_LOCATION, ...). - Linux / macOS SDKs (libc, syscalls, kAPI surface).
- Type information: structs, unions, enums.
sigmdcurrently records only function and method signatures plus per-call buffer descriptors. Layouts of the underlying types are not exported. - Stronger cross-TU dedup signals beyond
is_invalid_declarationand__OVERRIDE.
License
MIT.