apple-codesign 0.22.0

// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

/*! Apple code signing technical specifications

This document outlines how Apple code signing is implemented at a technical
level.

# High Level Overview

Mach-O binaries embed an optional binary blob containing code signing
metadata. This binary blob contains content digests of various aspects
of the binary (such as the executable code) as well as an optional
cryptographic signature which effectively attests to the digested
content of the binary.

At run-time, stored digests are used to help ensure file integrity.

The cryptographic signature is used to verify the digests haven't
been tampered with as well as to validate trust with the entity that
produced that signature.

See
<https://developer.apple.com/library/archive/technotes/tn2206/_index.html#//apple_ref/doc/uid/DTS40007919>
for an additional overview of how code signing works on Apple platforms.

# The Important Data Structures

Mach-O is the executable binary format used by Apple platforms. A
Mach-O binary contains (among other things), a series of named *segments*
holding arbitrary data and *load commands* instructing the loader how
to load/execute the binary.

Code signing data is embedded within the `__LINKEDIT` segment in a Mach-O
binary. An `LC_CODE_SIGNATURE` load command identifies the offsets of
code signing data within `__LINKEDIT`.

The code signing data within a `__LINKEDIT` segment is itself a collection
of sub-records. A *SuperBlob* header defines the signing data format, the
length of data to follow, and the number of sub-sections, or *Blob* within.
Each *Blob* occupies a defined *slot*. *Slots* are effectively well-known
pieces of signing data. These include a *Code Directory*, *Entitlements*,
and a *Signature*, among others. See the [crate::CodeSigningSlot]
enumeration for the known defined slots.

Each *Blob* contains its own header magic effectively identifying the
content type within and how bytes should be interpreted. The magic
values are independent of the *slot* type. However, there appears to be
a relationship between the two. For example, the code directory slot
will have header magic identifying the payload as a code directory structure.

The *Code Directory* blob/slot defines information about the binary
being signed. There are many fields to this data structure. But the most
important ones to understand are the hashes / content digests. The *Code
Directory* contains digests (e.g. SHA-256) of various content in the binary,
such as Mach-O segment data (i.e. the executable code) and other blobs/slots.

The *Entitlements* blob/slot contains a *plist*.

Additional file-based resources can also be signed. These are referred to as
*Code Resources*. *Code Resources* are captured in a
`_CodeSignature/CodeResources` XML plist file in the bundle and the digest
of this file is captured by the *Code Directory*. There is a defined
`RESOURCEDIR` slot to hold its digest. However, there is no explicit
magic constant for resources, implying that this data can only be provided
externally and not embedded within the *SuperBlob*.

The *Signature* blob/slot contains a Cryptographic Message Syntax (CMS)
RFC 5652 defined `SignedData` BER encoded ASN.1 data structure. CMS is
a specification for cryptographically signing arbitrary content. The
`SignedData` structure contains an additional set of *signed attributes*
(think of it as arbitrary extra content to sign), a cryptographic signature
of the signed data, and likely the X.509 certificate of the signer and its
chain of certificate signers.

# How Signing Works

Code signing logically consists of the following steps:

1. Collecting content that needs to be signed/attested/trusted.
2. Computing content digests.
3. Cryptographically signing a message derived from the content digests.
4. Adding signature data to Mach-O binary.

## Collecting Content

Embedded code signatures support signing a myriad of data formats.
These include but aren't limited to:

* The Mach-O data outside the signature data in the `__LINKEDIT` segment.
* Requested entitlements for the binary.
* A code requirement statement / expression.
* Resource files.

If your binary is already part of a *bundle*, content collection can
occur automatically using heuristics. e.g. the `Contents/Resources`
directory contains additional files whose content should be signed.

## Computing Content Digests

Once content has been assembled, a series of digests are computed.

For the code digests, the Mach-O segments are iterated. The raw segment
data is chunked into *pages* and each hashed separately. This is to allow
code data to be lazily hashed as a page is loaded into the kernel.
(Otherwise you would have to hash often megabytes on process start, which
would add overhead.)

Code hashes are a bit nuanced. A hash is emitted at segment boundaries. i.e.
hashes don't span across multiple segments. The `__PAGEZERO` segment is
not hashed. The `__LINKEDIT` segment is hashed, but only up to the start
offset of the embedded signature data, if present.

Other content (such as the entitlements, code requirement statement, and
resource files) are serialized to *Blob* data. The mechanism for this
varies by type. e.g. the entitlements plist is embedded as UTF-8
data and the code requirement statement is serialized into an expression
tree. The resulting *Blob* is then digested.

The content digests are then assembled into a *Code Directory* data
structure. Digests of code data are referred to to *code slots* and
digests of other entitles (namely *Blob* data) occupy *special slots*.
The *Code Directory* also contains important other information, such
as describing the hash/digest mechanism used, the page size for code
hashing, and executable limits for the binary.

The content of the *Code Directory* serialized to a *Blob* is then itself
digested. This value is known as the *code directory hash*.

## Cryptographic Signing

A cryptographic signature is produced using the Cryptographic Message
Syntax (CMS) signing mechanism.

From a high level, CMS takes as inputs:

* Optional content to sign.
* Optional set of additional attributes (effectively key-value data) to sign.
* A signing key.
* Information about the signing key (including its CA chain).

From these, CMS will produce a BER encoded ASN.1 blob containing the
cryptographic signature and sufficient metadata to verify it (such
as the signed attributes and information about the signing certificate).

In CMS speak, the *encapsulated content* being signed is not defined.
However, the `message-digest` signed attribute is the digest of the
*Code Directory* *Blob* data. (This appears to be not compliant with RFC 5652,
which says *encapsulated content* should be present in the *SignedObject*
structure. Omitting the data is likely done to avoid redundant storage
of this data in the Mach-O binary and/or to simplify parsing, as *Code
Directory* data wouldn't be embedded within an ASN.1 stream.)

In addition, there is a signed attribute for the signing time. There is
also an XML plist defining an array of base64 encoded *Code Directory*
hashes. There are multiple *slots* in a *SuperBlob* for code directories
and the array in the signed XML plist appears to allow hashes of all of
them to be recorded.

(TODO it isn't clear what the signed content is when there are multiple
*Code Directory* slots in use. Presumably `message-digest` is computed
over all of them.)

CMS will concatenate the *Code Directory* data with the DER serialized
ASN.1 structures defining the *signed attributes*. This becomes the
*plaintext* message to be signed.

This *plaintext* message is combined with a private key and cryptographically
signed (likely using RSA). This produces a *signature*.

CMS then serializes the *signature*, *signed attributes*, signer
certificate info, and other important metadata to a BER encoded ASN.1
data structure. This raw slice of bytes is referred to as the
*embedded signature*.

## Adding Signature Data to Mach-O Binary

The above steps have already materialized several *Blob* data
structures. The individual pieces like the entitlements and code requirement
*Blob* were materialized in order to compute their hashes for the *Code
Directory* data structure. And the *Code Directory* *Blob* was constructed
so it could be signed by CMS.

The *embedded signature* data produced by CMS is assembled into a *Blob*
structure. At this point, we have all the *Blob* ready.

All the *Blobs* are assembled together into a *SuperBlob*. The
*SuperBlob* is then written to the `__LINKEDIT` segment of the
Mach-O binary. An appropriate `LC_CODE_SIGNATURE` load command is
also written to the Mach-O binary to instruct where the *SuperBlob*
data resides.

The `__LINKEDIT` segment is the last segment in the Mach-O binary and
the *SuperBlob* often occupies the final bytes of the `__LINKEDIT`
segment. So in many cases adding code signature data to a Mach-O
requires an optional truncation to remove the existing signature then
file appends for the `__LINKEDIT` data.

However, insertion or removal of `LC_CODE_SIGNATURE` will require
rewriting the entire file and adjusting offsets in various Mach-O
data structures accordingly. In many cases, an existing code signature
can be replaced by truncating the `__LINKEDIT` section, writing the
replacement data, and updating sizes/offsets in-place in the segments
index and `LC_CODE_SIGNATURE` load command.

Note that there is a chicken-and-egg problem related to writing the
Mach-O binary and computing the digests of that binary for the *Code
Directory*! The *Code Directory* needs to compute a digest over the
content of the Mach-O file up until the signature data. But this needs
to be done before a CMS signature is produced, as we need to digest
the *Code Directory* for a CMS signed attribute. We also need to know
the size of the CMS signature, as it is part of the signature data
embedded in the Mach-O binary and its size needs to be recorded in
the `LC_CODE_SIGNATURE` load command and segment definitions, which
are hashed by the *Code Directory*. This is a circular dependency. A
trick to working around it is to pad the Mach-O signature data with
extra NULLs and record this extra long value in `LC_CODE_SIGNATURE`
before code digests are computed. The *SuperBlob* parser appears to
be lenient about this solution. Further note that calculating the
exact final length before CMS signature generation may be impossible
due to the CMS signature being non-deterministic (due to the use of
signing times and timestamp servers tokens, which could be variable
length).

# How Bundle Signing Works

Signing bundles (e.g. `.app`, `.framework` directories) has its own
complexities beyond signing individual binaries.

Bundles consist of multiple files, perhaps multiple binaries. These files
can be classified as:

1. The main executable.
2. The `Info.plist` file.
3. Support/resources files.
4. Code signature files.

When signing bundles, the high-level process is the following:

1. Find and sign all nested binaries and bundles (bundles can contain
   other bundles) except the main binary and bundle.
2. Identify support/resources files and calculate their hashes, capturing
   this metadata in a `CodeResources` XML file.
3. Sign the main binary with an embedded reference to the digest of the
   `CodeResources` file.

# How Verification Works

What happens when a binary is loaded? Read on to find out.

Please note that we don't know for sure what all occurs when a binary is
loaded because the code is proprietary. We do have some high-level
documentation from Apple and we can empirically observe what occurs.
We can also infer what is happening based on the signing technical
implementation, assuming Apple follows correct practices. But some content
of this section is speculation and is merely what *likely* occurs.

When a Mach-O binary is loaded, the loader looks for an
`LC_CODE_SIGNATURE` load command. If not found, there is no embedded
signature data and running the binary may be rejected.

The associated code signature data is located in the `__LINKEDIT` section
and parsed so *Blob* are discovered. How deeply it is parsed at this stage,
we don't know.

Data for the *Signature* slot/blob is obtained. This is the CMS *SignedData*
structure (BER encoded ASN.1). This structure is decoded and the cryptographic
signature, signed attributes, and X.509 certificates involved in the signing
are obtained from within.

We do not know the full extent of trust verification that occurs. But
Apple will examine details of the signing certificate and ensure its use
is allowed. For example, if the signing certificate wasn't issued/signed
by Apple or doesn't have the appropriate extensions present (such as bits
indicating the certificate is appropriate for code signing), it may refuse
to proceed. This trust validation likely occurs immediately after the
CMS data is parsed, as soon as the signing certificate information becomes
available for scrutiny.

The original *plaintext* message that was signed is assembled. This is
done by DER encoding the *signed attributes* from the CMS *SignedData*
structure.

This *plaintext* message, the signature of it, and the public key used
to produce the signature are all used to verify the cryptographic integrity
of the *signed attributes*. This effectively answers the question *did
something with possession of certificate X sign exactly the signed attributes
in this message.*

Successful signature verification ensures that the *signed attributes*
haven't been tampered with since they were signed.

The CMS data may also contain *unsigned attributes*. There may be
a *time stamp token* here containing a signature of the time when the
signed message was produced. This may be validated as well.

One of the signed attributes is `message-digest`. In this use of CMS,
`message-digest` is the digest of the *Code Directory* *Blob* data. This
digest is possibly verified: we don't know for sure. According to RFC 5652
it should be verified. However, it may not need to be because the digest
of the *Code Directory* data is stored elsewhere...

A signed attribute contains an XML plist containing an array of base64 encoded
hashes of *Code Directory* *blobs*. This plist is likely parsed and the hashes
within are compared to the hashes from the *Code Directory* blobs/slots from
the *SuperBlob* record. If the digests are identical, it means that the *Code
Directory* data structures in the Mach-O binary haven't been modified since the
signature was created.

The *Code Directory* data structures contain digests of code data and
other *Blob* data from the *SuperBlob*. Since the digest of the *Code Directory*
data was verified via CMS and a trust relationship was (presumably) established
with the signer of that CMS data, verification and trust is transitively applied
to the other *Blob* data and code data (this is effectively a Merkle Tree).
This means that we can digest other *Blob* entries and code data and compare to
the digests within the *Code Directory* structures. If the digests are identical,
content hasn't changed since the signature was made.

It is unclear in what order other *Blob* data is read. But presumably important
data like the embedded entitlements and code requirement statement are read very
early during binary loading so an appropriate trust policy can be applied to
the binary.
*/