# Reading Binary Descriptions
**hhh** can parse a binary description file to create a binary file. Use the `-p` or `--parse` flag to signal that you want to parse a file and create a binary file. Input files are expected to be UTF-8 text files, but may also be encoded in UTF-16.
```
$ hhh -p description.hhh -o file.bin
```
## Basics
At its most basic, parsing operates as follows.
- If an offset is found (signaled by a number followed by a colon `:`), then this becomes the new offset for the following bytes
- Groups of bytes are processed as they are encountered, separated either by spaces or by commas
- Groups are assumed to be in big endian order, unless little endian is specified
- Comments can be C (`/*..*/`) or C++ (`//`), and are ignored
- Bytes not explicitly specified are assumed to be 0x00
Offsets are typically, but do not have to be, at the start of a line. Except for `//` comments that require an end of line, whitespace is not significant.
Parsing the following creates a ten byte file, where the first byte is 0x40 and the last byte is 0x41 and all other bytes are zero.
```hhh -p
40 9:41
// Result
// 00000000: 40 00 00 00 00 00 00 00 00 41
```
This is the same as the following.
```hhh -p
0000: 40 00 00 00 00 00 00 00 00 41
// Result
// 00000000: 40 00 00 00 00 00 00 00 00 41
```
The same output can also be produced by the following, which specifies the offsets out of order.
```hhh -p
9:41 0:40
// Result
// 00000000: 40 00 00 00 00 00 00 00 00 41
```
## Overwriting
Because you can specify offsets out of order, you can overwrite prior bytes in the stream. For example, you can start with a simple hex dump of a file, and then add "patches" to the end of the file that overwrite selected prior bytes.
```hhh -p
// Final byte sequence is 01 02 03 04 05
0000: fe ff 21
0002: 03 04 05 // Overwrites 21
0000: 01 02 // Overwrites fe ff
// Result
// 00000000: 01 02 03 04 05
```
## Underscores
Underscores may appear in a number, and are ignored.
```hhh -p
0000_0000: 48_ae_21_45
// Result (big endian)
// 00000000: 48 ae 21 45
```
```hhh -p --little-endian
0000_0000: 48_ae_21_45
// Result (little endian)
// 00000000: 45 21 ae 48
```
## Strings
Double-quoted strings can appear in the file. These are interpreted as byte sequences in UTF-8 and permit the usual escapes, detailed below in [String Escapes](#string-escapes).
```hhh -p
0: "你好世界!"
10: "हैलो वर्ल्ड!"
20: "Hola Mundo!"
30: "Hello world!"
// Result
// 00000000: e4 bd a0 e5 a5 bd e4 b8 96 e7 95 8c 21 00 00 00
// 00000010: e0 a4 b9 e0 a5 88 e0 a4 b2 e0 a5 8b 20 e0 a4 b5
// 00000020: 48 6f 6c 61 20 4d 75 6e 64 6f 21 8d e0 a4 a1 21
// 00000030: 48 65 6c 6c 6f 20 77 6f 72 6c 64 21
```
## String Escapes
**hhh** allows a superset of the escapes used elsewhere, including those from C, C++, and Python.
|`\\` |Slash |
|`\'` |Apostrophe (U+0027) |
|`\"` |Quotation mark (U+0022) |
|`\a` |Bell (U+0007) |
|`\b` |Backspace (U+0008) |
|`\e` |Escape (U+001B) |
|`\f` |Form feed (U+000C) |
|`\n` |Newline (U+000A) |
|`\r` |Carriage return (U+000D) |
|`\t` |Horizontal tab (U+0009) |
|`\v` |Vertical tab (U+000B) |
|`\x` |Byte escape <sup>1</sup> |
|`\u` |Unicode escape <sup>2</sup> |
|`\N` |Named Unicode escape <sup>3</sup> |
|`\?` |Question mark (U+003F) |
<sup>1</sup> Byte escapes require exactly two hexadecimal digits. For example, a newline (U+000A) can also be written as `\x0a` or `\x0A`.
<sup>2</sup> Unicode escapes have one or more hex digits inside curly braces. For example, a newline (U+000A) can be written in several ways, including `\u{a}` and `\u{00000A}`.
<sup>3</sup> Named Unicode escapes interpret the Unicode code point name inside curly braces and generate the associated character. For example a newline (U+000A) can be written as `\N{new line}` or `\N{NEW LINE}`. Find the list of all Unicode character names in [the Unicode standard](https://home.unicode.org/).
```hhh -p
"\a\N{new line}\N{asterisk}\x40\x41\u{214e}"
// Result
// 00000000: 07 0a 2a 40 41 e2 85 8e
```
## Width
By default the width of a simple hex number is the ceiling of the number of digits divided by two. That is, `0` and `00` are encoded as single bytes, `000` and `0000` are encoded as two bytes, etc. This is intended to match the user's intuition and correctly decode groups produced by a hexdump.
If radix prefixes are used, the rule changes. Now the width of a number is taken to be the smallest number of bytes that can hold the number. So `0x0` is encoded as a single byte, as is `0x0000000000`, but `0x10_00_00` is encoded as *three* bytes, because this is the smallest number of bytes that can hold the value.
A value may be followed by a slash `/`. This can then be followed by a decimal number to indicate the number of bytes for the value. Note that this is similar to [Ghidra](https://ghidra-sre.org/)'s P-Code (but the colon is used for something else in **hhh**).
This specifies a four-byte (32-bit) zero.
```hhh -p
0/4
// Result
// 00000000: 00 00 00 00
```
Because of truncation, this also specifies a 32-bit zero for big endian byte order.
```hhh -p
fffe00000000/4
// Result
// 00000000: 00 00 00 00
```
Consider `24/2`. By default this specifies the sequence `00 24` (big endian). If little endian is selected, then this specifies the sequence `24 00`.
An offset is not a byte sequence, so you can't use a width specifier with it.
## Radix Prefixes
Radix prefixes specify the radix (hexadecimal, decimal, octal, binary) of the following number, and are enabled by the `--radix-prefixes` or `-r` switch or by the `[[prefix]]` directive.
|`0b` |Binary |`0b1001100` = 76 |
|`0o` |Octal |`0o114` = 76 |
|none |Decimal |76 = 76 |
|`0x` |Hexadecimal|`0x4c` = 76 |
Underscores can appear in these numbers to break up groups, such as `65_535`.
If prefixes are enabled, then values are assumed to be in decimal unless a radix prefix is explicitly given. The following encodes the string "Hello world!".
```hhh -p
[[prefix]] 0x00000000: 0x48 0x65 "l" "l" "o" 32 0x77/1 0x6f 0b1110010 0x6c 0x64 33/1
// Result
// 00000000: 48 65 6c 6c 6f 20 77 6f 72 6c 64 21
```
When using radix prefixes, numbers are allowed to be negative, and are interpreted as the two's complement byte sequence using the same width rules as elsewhere.
```hhh -p
[[prefix]] 0x00000000: -16/8
// Result
// 00000000: ff ff ff ff ff ff ff f0
```
Numbers support arbitrary precision signed values thanks to the [`num`] package.
## Byte Order (Endianness)
Endianness can be confusing. There are two forms.
- In big endian form, the most significant byte (the "big end") is written first, so the number `213e` is written as the byte sequence `21 3e`.
- In little endian form, the least significant byte (the "little end") is written first, so the number `213e` is written as the byte sequence `3e 21`.
The ordering set by the directives `[[little_endian]]` and `[[big_endian]]` as well as by the command line switch *only* changes how unprefixed byte groups are read.
### Without Prefixes
If radix prefixes are not used, then byte groups are interpreted according to the current default ordering, set by directive or command line switch. This effectively "undoes" the same setting on the hex output.
For example, if group is set to four and order to little endian, then the byte sequence `ff 01 02 03` is written as `030201ff` when creating a hex dump. In order to read this back correctly, we set the order to little endian and interpret `030201ff` as the byte sequence `ff 01 02 03`.
That is, the default ordering changes how byte groups are read.
Suppose we use `hhh --little-endian -b2 --no-ascii` with the following input.
```hhh --little-endian -b2 --no-ascii
HI!
// Result
// 00000000: 4948 0a21
```
Then we can recover the original file with `hhh -p --little-endian`.
```hhh -p --little-endian
00000000: 4948 0a21
// Result
// 00000000: 48 49 21 0a
```
### With Prefixes
If radix prefixes (see the section on [Radix Prefixes](#radix-prefixes)) are used, then numbers are presumed to be numbers and not simply byte sequences. This means that `0x21e3` represents the *number* `0x21e3`.
The ordering set by directives has *no effect* on the order of bytes in a prefixed number. If you wish to change the order, use a *suffix*, described in the section [Suffixes](#suffixes).
### Suffixes
If a width specifier is given, then an order specifier can also be given. This takes the form of either an `LE` (for little endian) or a `BE` (for big endian) and specifies that the number should be interpreted as written in little-endian or big-endian order. (This means `BE` is effectively a no-op.)
For numbers with radix prefixes, this simply changes the output order of the bytes, so (for example) `0x0102/2LE` *always* produces the byte sequence `0x02` `0x01`.
For byte groups it is applied after the default ordering set by directives or command line switch. This can be confusing. Essentially, suffixing the byte group with `LE` reverses it, but it may *already* have been reversed by setting the default to little endian. In short, **you should only use the suffixes with the default big endian ordering**.
- `[[little_endian]] 0102/2LE` represents the byte order `01 02`
- `[[little_endian]] 0102/2BE` represents the byte order `02 01`
```hhh -p
[[little_endian]] 0102/2LE
[[little_endian]] 0102/2BE
// Result
// 00000000: 01 02 02 01
```
## Expressions
Expressions in square brackets can be used to specify a number, either for an offset or for a byte or sequence of bytes. If a byte sequence is intended, a width specifier *should* follow the closing square bracket. Radix prefixes are enabled in the brackets, and decimal is assumed.
```hhh -p
0: [4 * 16 + 2]/2
// Result
// 00000000: 00 42
```
See the chapter [Expressions](expressions.md) for details.
## Repeat
A byte sequence can be enclosed in parentheses, with the closing parenthesis followed by a `*` and decimal number. This repeats the byte sequence the given number of times. Parenthesized expressions cannot contain an offset specification, assignments, or a nested parenthesized sequence.
```hhh -p
00000020: (6f 40)*8
// Result
// 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
// 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
// 00000020: 6f 40 6f 40 6f 40 6f 40 6f 40 6f 40 6f 40 6f 40
```
## Variables
Variables begin with a dollar sign `$` and consist of numbers, underscores, and letters. The variables `$_` and `$__` are special, and are described later.
Variables cannot use backward references; you cannot refer to a variable's value before it is assigned.
### Assignment
Assign (or re-assign) a name to a byte sequence using an equal sign. You can then use the variable to refer to that byte sequence, and can further modify it with length or order specifications.
```hhh -p
$x = 1000/4 // Now $x is the sequence 00 00 10 00
$y = 2000 // Now $y is the sequence 20 00
$x/3 // Emit the sequence 00 10 00
$y/2LE // Emit the sequence 00 20
// Result
// 00000000: 00 10 00 00 20
```
If the byte sequence is multiple groups, enclose them in parentheses. If it is a single group, parentheses are optional. You cannot nest parentheses.
When a bytes sequence is assigned a name, the endianness rules are applied *before* assigning the value, so that the value is stored in the correct byte order.
Symbolic names can be referenced by giving the name with an initial dollar sign. You can include length and endianness specifiers. A specification of `LE` reverses the stored byte order.
The assigned value can be the value of an expression, but assignments cannot occur inside expressions. Variable references, however, can occur in expressions.
```hhh -p
$a = [17]
$x = [((2**4) | (2**3) + $a) <<2]
$x/2
// Result
// 00000000: 00 a4
```
### Capture
Alternatively you can *capture* a byte group after it is emitted using `->` followed by a variable name. That is, the sequence `214e->$x` both emits the byte sequence and assigns it to `$x`.
The following illustrates the difference between simple assignment and capture.
```hhh -p
$x = 214a // Assign the bytes 21 4a to $x
214a -> $x // Emit 21 4a and then capture those bytes to $x
// The above will only produce 21 4a.
// Result
// 00000000: 21 4a
```
Here is a longer example.
```hhh -p
// Define three symbolic names.
00000020->$here $here: (20 24)->$bytes1 0040/2LE->$bytes2 $bytes1 $bytes2
// $here = 0x20
// $bytes1 = 0x2024/2
// $bytes2 = 0x4000/2
// Result
// 00000000: 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00
// 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
// 00000020: 20 24 40 00 20 24 40 00
```
The following produces a byte sequence in addition to assigning `$x`.
```hhh -p
$a = [17]
[((2**4) | (2**3) + 17) <<2]/4 -> $x
// Result
// 00000000: 00 00 00 a4
```
### Special Variables
The special symbolic name `$_` refers to the *current* offset and is always immediately replaced by that value. Thus you *cannot* assign to this variable.
```hhh -p
00000020: /* $_ is 0x20 */ $x=$_ 3f 41 2d 8a /* $_ is 0x24 */ $y=$_
00000030: $x/4 $y/4
// Result
// 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
// 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
// 00000020: 3f 41 2d 8a 00 00 00 00 00 00 00 00 00 00 00 00
// 00000030: 00 00 00 20 00 00 00 24
```
The special symbolic name `$__` (two underscores) refers to the enclosing biased base address. Most of the time, this is likely 0. The real use for this is determining the base address of a structure. See the section on [Structures](#structures) for details.
## Directives
Directives may be given inside double square brackets `[[..]]` and take effect immediately. A single directive may be given inside a `[[..]]` pair.
```hhh -p
// Enable radix prefixes.
[[prefix]]
// Specify a bias.
[[bias(0x1000)]]
0x1000: 20 30 40 50
// Result
// 00000000: 14 1e 28 32
```
Directives are discussed in their own chapter: [Directives](directive.md). Variables and expressions may be used in directives, and directives can be used to set variables.
```hhh -p
[[set("a",17)]]
$a/2LE
// Result
// 00000000: 11 00
```
## Bias
The offsets can be treated as absolute values; the sequence `10:ff` means the byte `ff` should occur at absolute offset 16 (hex `10`). Alternately, a *bias* can be subtracted from the explicitly given offsets. By default the bias is zero, so the final offset is the same as the explicitly given offset.
**NB**: When generating a hexdump, the bias is *added*. When parsing a binary description, the bias is *subtracted*.
The bias can be non-zero. For example, consider the following.
```hhh -p
[[bias(0x4000)]]
4000: fe ff 21 4e 8a
// Result
// 00000000: fe ff 21 4e 8a
```
The purpose of this is to allow incorporating hexdumps or other information where there is already an offset present. The bias just moves the offset up by the given amount. Offsets can also be *negative*, moving the bytes down by some amount.
Suppose I have a chunk of bytes that starts at 0x4000 and I want to incorporate them at an arbitrary position. That is, I have some bytes and I want the chunk (starting at 0x4000) to be added *here*. There is a directive that does that: `relative`.
```hhh -p
0000: 86 00 0f 8a 01
[[relative(0x4000)]]
4000: fe ff 21 4e 8a
// Result
// 00000000: 86 00 0f 8a 01 fe ff 21 4e 8a
```
You can undo the effect of either directive with `[[bias(0)]]`, explicitly setting the bias back to zero.
```hhh -p
0000: 21
[[relative(0x4000)]]
4000: 22
[[bias(0)]]
23 24
// Result
// 00000000: 21 22 23 24
```
```hhh -p
00000002: $x=$_ 21 $y=$_ 22 $z=$_ 23 $x/1 $y/1 $z/1
00000008: $x=$_ 21 22 23 24 $y=$_ 25 26 27 28 $z=$_ $x/1 $y/1 $z/1
// Result
// 00000000: 00 00 21 22 23 02 03 04 21 22 23 24 25 26 27 28
// 00000010: 08 0c 10
```
## Structures
A *structure* can be embedded in a file using curly braces `{..}`. These can be nested, and create a new scope with a bias such that offsets start at zero. The closing brace then restores the outer bias.
The actual base offset of the current scope can be obtained by `$__` (two underscores). Thus within a `{..}` pair, `$__` gives the offset of the *start* of the current scope.
Suppose we have the following C structure.
```c
struct mystruct {
long link_off,
long name_ptr,
long next_ptr,
long prev_ptr,
}
```
We might capture that directly (and provide an example) as follows.
```hhh -p
[[prefix]]
// Set up the offsets to elements of the structure.
$link_off = 0
$name_ptr = 8
$next_ptr = 16
$prev_ptr = 24
// Record the structure's total length.
$length = 32
0x0: 0xfe 0xff 0x10 0x22
0x10: $1name = $_ "First Element" 0 // $1name captures the address of this string
0x20: $2name = $_ "Second Element" 0 // $2name captures the address of this string
// Align on a 16-byte boundary.
[($_ + 15) & -16]:
// An instance of the structure.
{
$name_ptr: $1name/8LE // Pointer to the name.
$next_ptr: [$__ + $length]/8LE // Pointer to the next structure.
$prev_ptr: 0/8LE // Null
$length:
}
// Second instance of the structure.
{
$name_ptr: $2name/8LE // Pointer to the name.
$next_ptr: 0/8LE // Null
$prev_ptr: [$__ - $length]/8LE // Pointer to the previous structure.
$length:
}
0x21 0x43
// Result
// 00000000: fe ff 10 22 00 00 00 00 00 00 00 00 00 00 00 00
// 00000010: 46 69 72 73 74 20 45 6c 65 6d 65 6e 74 00 00 00
// 00000020: 53 65 63 6f 6e 64 20 45 6c 65 6d 65 6e 74 00 00
// 00000030: 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00
// 00000040: 50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
// 00000050: 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00
// 00000060: 00 00 00 00 00 00 00 00 30 00 00 00 00 00 00 00
// 00000070: 21 43
```
In the above the bias is initially zero. The structure contains four 8-byte values at the specific zero-based offsets. To assure that the correct size is used, the length is given as the final offset just prior to closing the curly braces.
Note that `$link_off` is not used in the structure, so its bytes remain zero. Also note the use of `$__` to reference the offset of the start of the structure.
This results in the following byte sequence. (Bytes not explicitly shown are zero.)
```
00000000: fe ff 10 22
00000010: 46 69 72 73 74 20 45 6c 65 6d 65 6e 74 00 // The first string
00000020: 53 65 63 6f 6e 64 20 45 6c 65 6d 65 6e 74 00 // The second string
// First structure.
00000030: 00 00 00 00 00 00 00 00 // link_off
00000038: 10 00 00 00 00 00 00 00 // name_ptr
00000040: 50 00 00 00 00 00 00 00 // next_ptr
00000048: 00 00 00 00 00 00 00 00 // prev_ptr
// Second structure.
00000050: 00 00 00 00 00 00 00 00 // link_off
00000058: 20 00 00 00 00 00 00 00 // name_ptr
00000060: 00 00 00 00 00 00 00 00 // next_ptr
00000068: 30 00 00 00 00 00 00 00 // prev_ptr
00000070: 21 43
```
Internally this is implemented with a simple stack. The opening `{` pushes the current bias onto the stack and then sets the bias so that the first address inside the braces is zero. The closing `}` restores the original bias from the stack.
[`num`]: https://crates.io/crates/num