sigmd 0.1.0

Windows API signature metadata
Documentation

Crates.io Downloads Docs License

Windows API Signature Metadata

sigmd parses Windows SDK and phnt headers via libclang, mines SAL annotations to derive function/method signatures and per-call buffer length expressions, and emits an architecture-keyed metadata bundle for tracers, hookers, and instrumentation tooling. The on-disk format is a single rkyv archive. There is no deserialization step at load time, just mmap and go.

Features

  • Win32 SDK and NT API coverage. Functions exported from kernel32, user32, advapi32, ..., plus the NT-private surface (Nt*, Rtl*, Ldr*, NtUser*, NtGdi*, ...) sourced from phnt.

  • COM interfaces and methods. IUnknown, IDispatch, all the way through DXGI, D3D, Shell, ... UUIDs, base interface, full method signatures.

  • SAL-derived buffer length expressions. _In_reads_bytes_(cb), _Out_writes_bytes_to_(max, *written), and friends are lowered into a small expression IR (constants, parameter refs, return value, +/-/*//, pointer dereference) so a tracer knows how many bytes to copy out of each pointer at call time.

  • Per-architecture sizeof resolution. Each header is parsed twice with --target=i686-pc-windows-msvc and --target=x86_64-pc-windows-msvc, because SIZE_T, ULONG_PTR, LPARAM, and others differ across architectures, and sizeof(IDENT) inside SAL only resolves correctly under the right target.

  • Near-instant load. Output is a ~45 MB rkyv archive read zero-copy via mmap. An optional ~178 MB JSON mirror is available for human inspection (--json).

  • Multiplatform. Headers are fetched cross-platform via xwin. The build itself runs anywhere libclang 18 or newer does (Linux, macOS, Windows).

  • Configurable. A YAML config selects header globs, include paths, ignore lists, and custom-type mappings.

Comparison with win32metadata

Microsoft's win32metadata is the canonical machine-readable Win32 surface, distributed as Windows.Win32.winmd. It's a great upgrade over hand-maintained bindings, but for instrumentation work it has three gaps:

  • No SAL. winmd carries types but not buffer length annotations, so a tracer still has to hand-author "for ReadFile, the byte count is parameter 2, the buffer is parameter 1, the post-call length is *lpNumberOfBytesRead" tables for every call it wants to capture.
  • No NT surface. Nt*/Rtl*/Ldr*/NtUser* aren't part of win32metadata. sigmd pulls them from phnt.
  • Format. winmd is a CLR metadata blob, and consumers need a CLR reader. sigmd ships a flat rkyv archive and a small Rust crate to navigate it. Roughly 2x the size, in exchange for SAL and NT.

How it works

   header.h
      │
      ├─► clang --target=i686-pc-windows-msvc   ──┐
      │   (force-include inject.h: SAL → annotate)│
      └─► clang --target=x86_64-pc-windows-msvc  ─┤
                                                  ▼
                                       per-arch dedup
                                       (high score wins,
                                        __OVERRIDE / ignore_*)
                                                  │
                                                  ▼
                                       rkyv ──► metadata.bin  (~45 MB)
                                            └─► metadata.json (~178 MB, opt-in --json)

Inject header

Every translation unit is compiled with -include assets/windows/include/inject.h, which redefines the SAL macros (_In_*, _Out_*, _Inout_*, _COM_Outptr_*, ...) to expand into __attribute__((annotate("__SAL:<counter>:<name>"))) markers. Without this, libclang strips SAL during parsing. The <counter> comes from __COUNTER__ and dodges clang's content-based dedup of AnnotateAttr across redeclarations (the "gethostname pathology" where two redeclarations share an arg name and get collapsed into one).

Double parse

Each header is compiled once for x86 and once for x64 because sizeof(IDENT) resolves to a different Constant between the two. SIZE_T is 4 bytes on x86 and 8 on x64, and SAL expressions like _In_reads_bytes_(sizeof(SIZE_T)) need both answers.

Score-based dedup

The same function appears in many headers (kernel32 forwarder vs ntdll real, ANSI vs W variants in different SDK headers, ...). For each duplicate name, sigmd keeps the variant with the highest score:

score(function) =
    sum over parameters:
        +1                       (always)
        +10  if HAS_IN_ATTRIBUTE
        +10  if HAS_OUT_ATTRIBUTE
        +100 if the parameter declaration is not invalid
    +100_000   if the function declaration is not invalid
    +1_000_000 if the function carries the __OVERRIDE annotation

Overrides

Some functions have broken or missing SAL: gethostname ships with a SAL annotation that confuses every analyzer that reads it, Process32First has none at all. assets/windows/src/override/*.c provides hand-written replacements tagged with __OVERRIDE, so the +1,000,000 term guarantees they win cross-TU arbitration over whatever the SDK shipped.

Undocumented APIs

APIs like CreateProcessInternalA/W aren't in the public SDK at all. Replacements live in assets/windows/src/undoc/*.c and feed the same parser.

Buffer model

Each buffer descriptor carries a direction (Input/Output), a phase (Pre = length known from input args before the call, Post = length known from output args after the call), and a length expression. The expression IR supports constants, parameter references, the function's return value, +/-/*//, and pointer dereference. sizeof(IDENT) is resolved at build time using the per-architecture sizeof table, so the wire format never needs a SizeOf operator.

Usage

Prerequisites: Rust 1.95+, libclang 18 or newer, and the phnt submodule:

git submodule update --init

Fetch the Windows SDK (one-time, cross-platform via xwin):

./scripts/fetch-winsdk.sh --output winsdk

Build the metadata bundle:

cargo run --release -- --config sigmd.yaml build --output metadata.bin

Pass --json metadata.json to also emit a human-readable mirror. The JSON output uses the serde feature, which is on by default through the cli feature.

As a library. The crate doubles as a zero-copy reader for the archive. Library consumers should disable default features to avoid the CLI dependency tree (clang-sys, clap, rayon, indicatif, ...):

[dependencies]
sigmd = { version = "0.1", default-features = false }
use std::fs;
use sigmd::Database;

let bytes = fs::read("metadata.bin")?;
let db = Database::from_bytes(&bytes)?;

// Look up a function in the x64 metadata.
let func = db.x64().function("NtCreateFile").expect("NtCreateFile");
for param in func.parameters() {
    println!(
        "  {:?}: {} (indir = {})",
        param.name(),
        param.ty().name(),
        param.ty().indirections(),
    );
}

// Walk output buffers. `length()` is the expression to evaluate
// against the call's arguments at trace time.
for buf in func.output_buffers() {
    println!("  out param {} len = {:?}", buf.parameter(), buf.length());
}

Configuration

# Glob patterns for SDK headers to be included in the database.
sdk:
  - winsdk/sdk/include/um/**/*.h
  - winsdk/sdk/include/shared/**/*.h
  - assets/windows/src/undoc/**/*.c
  - assets/windows/src/override/**/*.c
  - ...

# SDK header directories used for resolving includes.
# Passed to clang as `-isystem`.
include:
  - winsdk/sdk/include/um
  - assets/windows/include
  - assets/windows/include/phnt
  - ...

# Force-included headers.
# Passed to clang as `-include`.
inject:
  - assets/windows/include/inject.h

database:
  # Functions with these names will not be included in the database.
  # Leading OR trailing wildcards are supported.
  # For example, `*CreateFile` or `CreateFile*`.
  ignore_functions:
    # `RtlFreeAnsiString` has the same entry point as `RtlFreeUnicodeString`.
    # Let's prefer `RtlFreeUnicodeString`, since the latter is more commonly
    # used.
    - RtlFreeAnsiString

  # Types with these names will not be included in the database.
  # Leading OR trailing wildcards are supported.
  ignore_interfaces: []

# Custom type tags. See `TypeKind::Custom(u8)`.
custom_types:
  - id: 1
    name: ANSI_STRING
    matches: [_ANSI_STRING, _STRING, _LSA_STRING]

The id in custom_types is part of the on-disk wire format. Never reuse a retired id.

TODO

  • WDK headers (kernel-mode drivers, IRPs, IO_STACK_LOCATION, ...).
  • Linux / macOS SDKs (libc, syscalls, kAPI surface).
  • Type information: structs, unions, enums. sigmd currently records only function and method signatures plus per-call buffer descriptors. Layouts of the underlying types are not exported.
  • Stronger cross-TU dedup signals beyond is_invalid_declaration and __OVERRIDE.

License

MIT.