hvm-core 0.1.13

HVM-Core is a massively parallel Interaction Combinator evaluator.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
# HVM-Core: a parallel Interaction Combinator evaluator

HVM-Core is a parallel evaluator for extended [Symmetric Interaction Combinators](https://www-lipn.univ-paris13.fr/~mazza/papers/CombSem-MSCS.pdf).

We provide a raw syntax for specifying nets, a reference implementation in Rust,
and a massively parallel evaluator in CUDA. In our benchmarks, HVMC performs up
to **6.8 billion interactions** per second in an RTX 4090. When evaluating
functional programs, it performed 5x faster than the best alternatives, making
it a great compile target for high-level languages seeking massive parallelism.

HVM-Core will be used as the compile target for [HVM](https://github.com/higherorderco/hvm)
on its upcoming update.

## Usage

Install HVM-Core as:

```
cargo install hvm-core
```

Then, run the reference single-core interpeter as:

```
hvmc run file.hvmc -s
```

You can also compile it to a fast executable as:

```
hvmc compile file.hvmc
./file
```

If you have a GPU, run it with thousands of threads as:

```
hvmc bend file.hvmc -s
```

**Note**: `bend` wasn't implemented yet; if you're here *today* and wants to run
it on the GPU, you'll need to manually edit `hvm2.cu` using the
`hvmc gen-cuda-book file.hvmc` command. Compile with `-arch=compute_89`. I'll
add that to the CLI soon. Accepting PRs though :)

## Example

HVMC is a low-level compile target for high-level languages. The
file below performs a simple recursive sum:

```javascript
@sum = (? (#1 @sumS) a a)

@sumS = ({2 a b} c)
  & @sum ~ (a e)
  & @sum ~ (b d)
  & #1   ~ <d <e c>>

@main = R
  & @sum ~ (#24 R)
```

If that look alien to you, don't worry; it does to me too. Fortunatelly, we have
`[HVM-Lang](https://github.com/HigherOrderCO/hvm-lang)`, a tool that generates
`.hvmc` files from a familiar syntax. On HVM-Lang, the program above is just:

```javascript
sum = λn match n {
  0   : 1
  1+p : (+ (sum p) (sum p))
}

main = (sum 24)
```

If you do want to understand the hardcore syntax, though, keep reading. For more
examples, see the [`/examples`](/examples) directory.

## Language

HVM-Core provides a textual syntax that allows us to construct interaction
combinator nets using a simple AST:

```
<TERM> ::=
  <ERA> ::= "*"
  <CON> ::= "(" <TERM> " " <TERM> ")"
  <DUP> ::= "[" <TERM> " " <TERM> "]"
  <CTR> ::= "{" <label> " " <TERM> " " <TERM> "}"
  <REF> ::= "@" <name>
  <U24> ::= "#" <value>
  <OP2> ::= "<" <TERM> " " <TERM> ">"
  <MAT> ::= "?" <TERM> <TERM>
  <VAR> ::= <name>

<NET> ::=
  <ROOT> ::= <TERM>
  <RDEX> ::= "&" <TERM> "~" <TERM> <NET>

<BOOK> ::= 
  <DEF> ::= "@" <NET> <BOOK>
  <END> ::= <EOF> 
```

On top of pure interaction combinators, HVMC includes a minimal set of
performance-critical features, including top-level definitions (as closed nets),
unboxed 24-bit numbers, binary numeric operations and if-then-else. Below is a
complete list of all term variants:

- `ERA`: an eraser node, as defined on the reference system.

- `CON`: a constructor node, as defined on the reference system.

- `DUP`: a duplicator node, as defined on the reference system.

- `CTR`: an "extra" node, which behaves exactly like CON/DUP nodes, but with a
  different symbol. When the label is 0/1, it corresponds to a CON/DUP node.

- `VAR`: a named variable, used to create a wire. Each name must occur twice,
  denoting both endpoints of a wire.

- `REF`: a reference to a top-level definition, which is itself a closed net.
  That reference is unrolled lazily, allowing for recursive functions to be
  implemented without the need for Church numerals and the like.

- `U24`: an unboxed 24-bit unsigned integer.

- `OP2`: a binary operation on u24 operands.

- `MAT`: a pattern-matching operator on u24 values.

Note that terms form a tree-like structure. Yet, interaction combinators are not
trees, but graphs; terms aren't enough to express all possible nets. To fix
that, we provide the `& <TERM> ~ <TERM>` syntax, which connects the top-most
main port of each tree. This allows us to build any closed nets with a single
free wire. For example, to build the closed net:

```
R       .............
:       :           :
:      /_\         /_\
:    ..: :..     ..: :..
:    :     :     :     :
:   /_\   /_\   /_\   /_\
:...: :   : :...: :   :.:
      *   :.......:
```

We could use the following syntax:

```
@main
  = R
  & ((R *) (x y))
  ~ ((y x) (z z))
```

Here, `@main` is the name of the closed net, `R` is used to denote its single
free wire, each CON node is denoted as `(x y)`, and the ERA node is represented
as a `*`. The wires from an aux port to a main port are denoted by the tree
hierarchy, the wires between aux ports are denoted by named variables, and the
single wire between main ports is denoted by the `& A ~ B` syntax. Note this
always represents an active pair (or redex)!

## Evaluator

HVMC comes with 2 evaluators: a reference interpreter in Rust, and a massively
parallel runtime in CUDA. Both evaluators are completely eager, which means they
will reduce *every* generated active pair (redex) in an ultra-greedy fashion.
Because of that, to perform recursion, it is advisable to pre-compile the source
language to a [supercombinator](https://en.wikipedia.org/wiki/Supercombinator)
formulation, as it allows sub-expressions to be unrolled lazily, preventing HVMC
from infinitely expanding recursive function bodies. For the same reason, terms
like the Y-Combinator aren't compatible with current HVMC. A lazy evaluator
version will be provided in the future.

The eager evaluator works by keeping a vector of current active pairs (redexes)
and, for-each redex, performing an "interaction", as described below. On the
single-core version, that "for-each" is done in a sequential loop. On the
multi-core version, the vector of redexes is actually split into a grid of
"redex bags", each bag owned by a "rewrite squad", which continuously pops and
executes a redex in parallel. Since interaction rules are symmetric on the 4
surrounding ports, we actually use 4 threads to perform an interaction, thus the
name "squad".

For example, on NVidia's RTX 4090, we keep a grid of 128x128 redex bags. Each
bag is processed by a rewrite squad. There are `16,384` active squads, for a
total of `65,536` active threads. Below is a visual representation:

```
REDEX ::= (Ptr32, Ptr32)

    A1 --|\        /|-- B2
         | |A0--B0| |
    A2 --|/        \|-- B1

REDEX_BAG ::= Vec<REDEX>

    [(A0,B0), (C0,D0), (E0,F0), ...]

REDEX_GRID ::= Matrix<128, 128 REDEX_BAG>

    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] -- [ ] ...
     |      |      |      |      |      |      |      |
    ...    ...    ...    ...    ...    ...    ...    ... 

SQUAD ::= Vec<4, THREAD>

THREAD ::=

    loop {
      if (A, B) = pop_redex():
        atomic_rewrite(A, B)
      share_redexes()
    }

    atomic_rewrite(A, B):
      if con-con:
        thread 0: atomic_subst(A.1, B.1)
        thread 1: atomic_subst(A.2, B.2)
        thread 2: atomic_subst(B.1, A.1)
        thread 3: atomic_subst(B.2, A.2)
      else if con-dup:
        ...

    atomic_subst(a, b):
      a_ok = CAS(a, a.rev(), b)
      b_ok = CAS(b, b.rev(), a)
      if not (a_ok and b_ok):
        atomic_link(a, a.rev(), b)

    atomic_link(a, b):
      see algorith on 'paper/'

RESULT:

    - thread 0 links A1 -> B1
    - thread 1 links B1 -> A1
    - thread 2 links A2 -> B2
    - thread 3 links B2 -> A2

OUTPUT:

    A1 <--------------> B1
    A2 <--------------> B2
```

With this setup, a program reaches the peak parallelism when there are about 16k
active redexes, resulting in a maximum theoretical speedup of 65536x. Note that
even a "maximally sequential" interaction combinator program (i.e., one that, at
any point of the reduction, always has exactly 1 active pair) still benefits
from a "4x" degree of parallelism, due to squads having 4 threads; although,
obviously, at that point a CPU version would be much faster, as CPU cores are
way faster than 4 GPU cores.

## Interactions

The available interactions are the same described on the reference paper, namely
annihilation and commutation rules, plus a few additional interactions,
including primitive numeric operations, conditionals, and dereference. The core
interactions are:

```
()---()
~~~~~~~ ERA-ERA
nothing
```

```
A1 --|\
     |a|-- ()
A2 --|/
~~~~~~~~~~~~~ CTR-ERA
A1 ------- ()
A2 ------- ()
```

```
A1 --|\     /|-- B2
     |a|---|b|   
A2 --|/     \|-- B1
~~~~~~~~~~~~~~~~~~~ CTR-CTR (if a ~~ b)
A1 -----, ,----- B2
         X
A2 -----' '----- B1
```

```
A1 --|\         /|-- B2
     |a|-------|b|   
A2 --|/         \|-- B1
~~~~~~~~~~~~~~~~~~~~~~~ CTR-CTR (if a !~ b)
      /|-------|\
A1 --|b|       |a|-- B2
      \|--, ,--|/
           X
      /|--' '--|\
A2 --|b|       |a|-- B1
      \|-------|/
```

The dereference interactions happen when a @REF node interacts with another
node. When that node is a constructor, the dereference will be unrolled
efficiently. This makes HVM practical, because, without it, top-level
definitions would need to be implemented with DUP nodes. This would cause
considerable overhead when trying to implement functions, due to DUP nodes
incremental copying nature. When the other node is anything else, that implies
two closed nets got disconnected from the main graph, so both nodes are
collected, allowing recursive functions to halt without infinite expansions.

```
() -- @REF
~~~~~~~~~~ ERA-REF
nothing
```

```
A1 --|\
     | |-- @REF
A2 --|/
~~~~~~~~~~~~~~~~ CTR-REF
A1 --|\
     | |-- {val}
A2 --|/
```

Since Interaction Combinator nodes only have 1 active port, which is a property
that is essential for key computational characteristics such as strong
confluence, we can't have a binary numeric operation node. Instead, we split
numeric operations in two nodes: OP2, which processes the first operand and
returns an OP1 node, which then processes the second operand, performs the
computation, and connects the result to the return wire.

```
A1 --,
     [}-- #X
A2 --' 
~~~~~~~~~~~~~~ OP2-NUM
A2 --[#X}-- A1
```

```
A1 --[#X}-- #Y
~~~~~~~~~~~~~~ OP1-NUM
A1 -- #Z
```

Note that the OP2 operator doesn't store the operation type. Instead, it is
stored on 4 unused bits of the left operand. As such, an additional operation
called "load-op-type" is used to load the next operation on the left operand.
See the `/examples` directory for more info. Below is a table with all available
operations:

N   | operation
--- | ---------
  0 | load-op-type
  1 | addition
  2 | subtraction
  3 | multiplication
  4 | division
  5 | modulus
  6 | equal-to
  7 | not-equal-to
  8 | less-than
  9 | greater-than
  A | logical-and
  B | logical-or
  C | logical-xor
  D | logical-not
  E | left-shift
  F | right-shift

Since HVM already provides plenty of solutions for branching (global references,
lambda encoded booleans and pattern-matching, etc.), the pattern-match operation
is only necessary to read bits from numbers: otherwise, numbers would be "black
boxes" that can't interact with the rest of the program. The way it works is
simple: it receives a number, two branches (case-zero and case-succ, stored in a
CON node) and a return wire. If the number is 0, it erases the case-succ branch
and returns the case-zero branch. Otherwise, it erases the case-zero branch and
returns the case-succ branch applied to the predecessor of the number.

```
A1 --,
     (?)-- #X
A2 --' 
~~~~~~~~~~~~~~~~~~ MAT-NUM (#X > 0)
           /|-- A2
      /|--| |
A1 --| |   \|-- #(X-1)
      \|-- ()
```

```
A1 --,
     (?)-- #X
A2 --' 
~~~~~~~~~~~~~~~~~~ MAT-NUM (#X == 0)
      /|-- ()
A1 --| |   
      \|-- A2
```

Note that some interactions like NUM-ERA are omitted, but should logically
follow from the ones described above.

## Memory Layout

The memory layout is optimized for efficiency, and can be summarized as:

```rust
// A pointer is a 32-bit word
type Ptr = u32;

// A node stores its two aux ports
struct Node {
  p1: Ptr, // this node's fst aux port
  p2: Ptr, // this node's snd aux port 
}

// A redex links two main ports
struct Redex {
  a: Ptr, // main port of node A
  b: Ptr, // main port of node B
}

// A closed net
struct Net {
  root: Ptr,       // a free wire
  rdex: Vec<Redex> // a vector of redexes
  heap: Vec<Node>  // a vector of nodes
}
```

As you can see, the memory layout resembles the textual syntax, with nets being
represented as a vector of trees, with the 'redex' buffer storing the tree roots
(as active pairs), and the 'nodes' buffer storing all the nodes. Each node has
two 32-bit pointers and, thus, uses exactly 64 bits. Pointers include a 4-bit
tag and a 28-bit value, which allows addressing a 2 GB space per instance. There
are 16 pointer types:

```rust
VR1 = 0x0; // variable to aux port 1
VR2 = 0x1; // variable to aux port 2
RD1 = 0x2; // redirect to aux port 1
RD2 = 0x3; // redirect to aux port 2
REF = 0x4; // lazy closed net
ERA = 0x5; // unboxed eraser
NUM = 0x6; // unboxed number
OP2 = 0x7; // numeric operation (binary)
OP1 = 0x8; // numeric operation (unary)
ITE = 0x9; // numeric if-then-else
CT0 = 0xA; // main port of con
CT1 = 0xB; // main port of dup
CT2 = 0xC; // main port of extra node 
CT3 = 0xD; // main port of extra node
CT4 = 0xE; // main port of extra node
CT5 = 0xF; // main port of extra node
```

This memory-efficient format allows for a fast implementation in many
situations. For example, allocation can be performed by a single 64-bit atomic
CAS, and an annihilation interaction can be done with 2 calls to an
`atomic_link()` procedure, which, in most cases, only requires 2 atomic reads
and 2 atomic CAS's.

We also provide unboxed 24-bit unsigned integers, which allows HVMC to store raw
data with minimal loss. For example, to store a raw 3 KB buffer, one could use a
perfect binary tree of CON nodes with a depth of 9, as follows:

```javascript
@buff = ((((((((((X0 X1) (X2 X3)) ((X4 X5) (X6 X7))) ...)))))))
```

This would use a total of 1023 nodes, which takes a space of almost exactly 8 KB
on HVMC. As such, while buffers are not part of the spec, we can store raw data
with a 37.5% efficiency using pure interaction nets, which is a reasonable price
to pay to gain the ability of computing them with massive parallelism.

## Contributing

To verify if there's no performance regression:

```bash
git checkout main
cargo bench -- --save-baseline main # save the unchanged code as "main"
git checkout <your-branch>
cargo bench -- --baseline main      # compare your changes with the "main" branch
```

To verify if there's no correctness regression, run `cargo test`. You can add `cargo-insta` with `cargo install cargo-insta` and run `cargo insta test` to see if all test cases pass, if some don't and they need to be changed, you can run `cargo insta review` to review the snapshots and correct them if necessary.