sigmd 0.1.0

Windows API signature metadata
Documentation
[![Crates.io](https://img.shields.io/crates/v/sigmd.svg)](https://crates.io/crates/sigmd)
[![Downloads](https://img.shields.io/crates/d/sigmd.svg)](https://crates.io/crates/sigmd)
[![Docs](https://docs.rs/sigmd/badge.svg)](https://docs.rs/sigmd/latest/sigmd/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/vmi-rs/sigmd/blob/main/LICENSE)

<!-- readme start -->
# Windows API Signature Metadata

`sigmd` parses Windows SDK and [phnt] headers via libclang, mines SAL
annotations to derive function/method signatures and per-call buffer
length expressions, and emits an architecture-keyed metadata bundle for
tracers, hookers, and instrumentation tooling. The on-disk format is a
single [rkyv] archive. There is no deserialization step at load time,
just `mmap` and go.

## Features

- **Win32 SDK and NT API coverage.** Functions exported from `kernel32`,
  `user32`, `advapi32`, ..., plus the NT-private surface (`Nt*`,
  `Rtl*`, `Ldr*`, `NtUser*`, `NtGdi*`, ...) sourced from [phnt].

- **COM interfaces and methods.** `IUnknown`, `IDispatch`, all the way
  through DXGI, D3D, Shell, ... UUIDs, base interface, full
  method signatures.

- **SAL-derived buffer length expressions.** `_In_reads_bytes_(cb)`,
  `_Out_writes_bytes_to_(max, *written)`, and friends are lowered into
  a small expression IR (constants, parameter refs, return value,
  `+`/`-`/`*`/`/`, pointer dereference) so a tracer knows how many
  bytes to copy out of each pointer at call time.

- **Per-architecture sizeof resolution.** Each header is parsed twice
  with `--target=i686-pc-windows-msvc` and
  `--target=x86_64-pc-windows-msvc`, because `SIZE_T`, `ULONG_PTR`,
  `LPARAM`, and others differ across architectures, and `sizeof(IDENT)`
  inside SAL only resolves correctly under the right target.

- **Near-instant load.** Output is a ~45 MB rkyv archive read zero-copy
  via `mmap`. An optional ~178 MB JSON mirror is available for human
  inspection (`--json`).

- **Multiplatform.** Headers are fetched cross-platform via [xwin]. The
  build itself runs anywhere libclang 18 or newer does (Linux, macOS,
  Windows).

- **Configurable.** A YAML config selects header globs, include paths,
  ignore lists, and custom-type mappings.

## Comparison with `win32metadata`

Microsoft's [win32metadata] is the canonical machine-readable Win32
surface, distributed as `Windows.Win32.winmd`. It's a great upgrade over
hand-maintained bindings, but for instrumentation work it has three
gaps:

- **No SAL.** `winmd` carries types but not buffer length annotations,
  so a tracer still has to hand-author "for `ReadFile`, the byte count
  is parameter 2, the buffer is parameter 1, the post-call length is
  `*lpNumberOfBytesRead`" tables for every call it wants to capture.
- **No NT surface.** `Nt*`/`Rtl*`/`Ldr*`/`NtUser*` aren't part of
  `win32metadata`. `sigmd` pulls them from [phnt].
- **Format.** `winmd` is a CLR metadata blob, and consumers need a CLR
  reader. `sigmd` ships a flat rkyv archive and a small Rust crate to
  navigate it. Roughly 2x the size, in exchange for SAL and NT.

## How it works

```
   header.h
      ├─► clang --target=i686-pc-windows-msvc   ──┐
      │   (force-include inject.h: SAL → annotate)│
      └─► clang --target=x86_64-pc-windows-msvc  ─┤
                                       per-arch dedup
                                       (high score wins,
                                        __OVERRIDE / ignore_*)
                                       rkyv ──► metadata.bin  (~45 MB)
                                            └─► metadata.json (~178 MB, opt-in --json)
```

### Inject header

Every translation unit is compiled with
`-include assets/windows/include/inject.h`, which redefines the SAL
macros (`_In_*`, `_Out_*`, `_Inout_*`, `_COM_Outptr_*`, ...) to expand
into `__attribute__((annotate("__SAL:<counter>:<name>")))` markers.
Without this, libclang strips SAL during parsing. The `<counter>` comes
from `__COUNTER__` and dodges clang's content-based dedup of
`AnnotateAttr` across redeclarations (the "`gethostname` pathology"
where two redeclarations share an arg name and get collapsed into one).

### Double parse

Each header is compiled once for x86 and once for x64
because `sizeof(IDENT)` resolves to a different `Constant` between the
two. `SIZE_T` is 4 bytes on x86 and 8 on x64, and SAL expressions like
`_In_reads_bytes_(sizeof(SIZE_T))` need both answers.

### Score-based dedup

The same function appears in many headers
(kernel32 forwarder vs ntdll real, ANSI vs W variants in different SDK
headers, ...). For each duplicate name, `sigmd` keeps the variant with
the highest score:

```
score(function) =
    sum over parameters:
        +1                       (always)
        +10  if HAS_IN_ATTRIBUTE
        +10  if HAS_OUT_ATTRIBUTE
        +100 if the parameter declaration is not invalid
    +100_000   if the function declaration is not invalid
    +1_000_000 if the function carries the __OVERRIDE annotation
```

### Overrides

Some functions have broken or missing SAL: `gethostname`
ships with a SAL annotation that confuses every analyzer that reads it,
`Process32First` has none at all. `assets/windows/src/override/*.c`
provides hand-written replacements tagged with `__OVERRIDE`, so the
+1,000,000 term guarantees they win cross-TU arbitration over whatever
the SDK shipped.

### Undocumented APIs

APIs like `CreateProcessInternalA/W` aren't in
the public SDK at all. Replacements live in `assets/windows/src/undoc/*.c`
and feed the same parser.

### Buffer model

Each buffer descriptor carries a `direction`
(`Input`/`Output`), a `phase` (`Pre` = length known from input args
before the call, `Post` = length known from output args after the call),
and a `length` expression. The expression IR supports constants,
parameter references, the function's return value, `+`/`-`/`*`/`/`, and
pointer dereference. `sizeof(IDENT)` is resolved at build time using the
per-architecture sizeof table, so the wire format never needs a `SizeOf`
operator.

## Usage

**Prerequisites:** Rust 1.95+, libclang 18 or newer, and the [phnt]
submodule:

```sh
git submodule update --init
```

**Fetch the Windows SDK** (one-time, cross-platform via [xwin]):

```sh
./scripts/fetch-winsdk.sh --output winsdk
```

**Build the metadata bundle:**

```sh
cargo run --release -- --config sigmd.yaml build --output metadata.bin
```

Pass `--json metadata.json` to also emit a human-readable mirror. The
JSON output uses the `serde` feature, which is on by default through
the `cli` feature.

**As a library.** The crate doubles as a zero-copy reader for the
archive. Library consumers should disable default features to avoid
the CLI dependency tree (clang-sys, clap, rayon, indicatif, ...):

```toml
[dependencies]
sigmd = { version = "0.1", default-features = false }
```

```rust,ignore
use std::fs;
use sigmd::Database;

let bytes = fs::read("metadata.bin")?;
let db = Database::from_bytes(&bytes)?;

// Look up a function in the x64 metadata.
let func = db.x64().function("NtCreateFile").expect("NtCreateFile");
for param in func.parameters() {
    println!(
        "  {:?}: {} (indir = {})",
        param.name(),
        param.ty().name(),
        param.ty().indirections(),
    );
}

// Walk output buffers. `length()` is the expression to evaluate
// against the call's arguments at trace time.
for buf in func.output_buffers() {
    println!("  out param {} len = {:?}", buf.parameter(), buf.length());
}
```

## Configuration

```yaml
# Glob patterns for SDK headers to be included in the database.
sdk:
  - winsdk/sdk/include/um/**/*.h
  - winsdk/sdk/include/shared/**/*.h
  - assets/windows/src/undoc/**/*.c
  - assets/windows/src/override/**/*.c
  - ...

# SDK header directories used for resolving includes.
# Passed to clang as `-isystem`.
include:
  - winsdk/sdk/include/um
  - assets/windows/include
  - assets/windows/include/phnt
  - ...

# Force-included headers.
# Passed to clang as `-include`.
inject:
  - assets/windows/include/inject.h

database:
  # Functions with these names will not be included in the database.
  # Leading OR trailing wildcards are supported.
  # For example, `*CreateFile` or `CreateFile*`.
  ignore_functions:
    # `RtlFreeAnsiString` has the same entry point as `RtlFreeUnicodeString`.
    # Let's prefer `RtlFreeUnicodeString`, since the latter is more commonly
    # used.
    - RtlFreeAnsiString

  # Types with these names will not be included in the database.
  # Leading OR trailing wildcards are supported.
  ignore_interfaces: []

# Custom type tags. See `TypeKind::Custom(u8)`.
custom_types:
  - id: 1
    name: ANSI_STRING
    matches: [_ANSI_STRING, _STRING, _LSA_STRING]
```

The `id` in `custom_types` is part of the on-disk wire format. Never
reuse a retired id.

## TODO

- **WDK headers** (kernel-mode drivers, IRPs, `IO_STACK_LOCATION`, ...).
- **Linux / macOS** SDKs (libc, syscalls, kAPI surface).
- **Type information**: structs, unions, enums. `sigmd` currently
  records only function and method signatures plus per-call buffer
  descriptors. Layouts of the underlying types are not exported.
- **Stronger cross-TU dedup signals** beyond `is_invalid_declaration`
  and `__OVERRIDE`.

## License

MIT.

<!-- readme end -->

[phnt]: https://github.com/winsiderss/phnt
[xwin]: https://github.com/Jake-Shadle/xwin
[rkyv]: https://github.com/rkyv/rkyv
[win32metadata]: https://github.com/microsoft/win32metadata