hvm 2.0.22

A massively parallel, optimal functional runtime in Rust.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
#import "@preview/unequivocal-ams:0.1.0": ams-article, theorem, proof
#import "@preview/cetz:0.2.2": canvas, draw
#import "inet.typ"

#show link: underline
#set cite(form: "normal", style: "iso-690-author-date")


#show: ams-article.with(
  title: [HVM2: A Parallel Evaluator for Interaction Combinators],
  authors: ((name: "Victor Taelin", company: "Higher Order Company", email: "taelin@HigherOrderCO.com"),),
  abstract:[
    We present HVM2, an efficient, massively parallel evaluator for extended interaction combinators. When compiling non-sequential programs from a high-level programming language to C and CUDA, we achieved a near-ideal parallel speedup as a function of cores available, scaling from 400 million interactions per second (MIPS) (Apple M3 Max; single thread), to 5,200 MIPS (Apple M3 Max; 16 threads), to 74,000 MIPS (NVIDIA RTX 4090; 32,768 threads). In this paper we describe HVM2's architecture, present snippets of the reference implementation in Rust, share early benchmarks and experimental results, and discuss current limitations and future plans.
  ],
  bibliography: bibliography("refs.bib", style: "annual-reviews"),
)

*This paper is a work in progress. See the #link("https://github.com/HigherOrderCO/HVM")[HVM repo] for the  latest version.*

= Introduction

Interaction Nets (IN's) @Lafont_1990 and Interaction Combinators (IC's) @Lafont_1997 were introduced by Lafont as a minimal and concurrent model of computation. Lafont proved that IC's were not only Turing Complete, but that they also preserve the complexity class and degree of parallelism. Moreover, Lafont argued that while Turing Machines are a universal model of sequential computation, IC's are a universal model of #emph[distributed] computation. The locality and strong confluence of Lafont's ICs make it suitable for massive parallel computation. This heavily implied that IC's are an optimal model of computation, in a very fundamental sense. Yet, it remained to be seen if this system could be implemented efficiently in practice.

In this paper, we answer this question positively. By storing Interaction Combinator nodes in a memory-efficient format, we're able to implement its core operations (annihilation, commutation, and erasure) as lightweight C procedures and CUDA kernels. Furthermore, by representing wires as atomic variables, we're able to perform interactions atomically, in a lock-free fashion and with minimal synchronization. We also extend our system with global definitions (for fast function applications) and native numbers (for fast numeric operations). The result, HVM2, is an efficient, massively parallel evaluator for ICs that achieves near-ideal speedup, up to at least 16,384 concurrent cores, peaking at 74 billion interactions per second on an NVIDIA RTX 4090.

This level of performance makes it compelling to propose HVM2 as a general
framework for parallel computing. By translating constructs such as functions, algebraic data types, pattern matching, and recursion to HVM2, we see it as a potential compilation target for modern programming languages such as Python and Haskell. As a demonstration of this possibility, we also introduce #link("https://github.com/HigherOrderCO/bend")[Bend], a high-level programming language that compiles to HVM2. We explain how some of these translations work, and set up a general framework to translate arbitrary languages, procedural or functional, to HVM2.

#pagebreak()

= Similar Works

*Work In Progress*

#pagebreak()

= Syntax

HVM2's syntax consists of an Interaction Calculus system which textually represents an Interaction Combinator system @Calculus_1999. This textual system is capable of representing any arbitrary Interaction Net, and it is therefore possible to represent "vicious circles" @Lafont_1997. We only consider HVM2 programs which do not contain any vicious circles.

HVM2's syntax has seven different types of _agents_ (in Lafont's terms) or _Nodes_. Additionally, HVM2 also has _Variables_ to represent wires which connect ports across Nodes. _Trees_ are either Variables or Nodes. They are represented syntactically as:

#align(center)[
  ```
  <Node> ::=
      | "*"                    -- (ERA)ser
      | "@" <alphanumeric>     -- (REF)erence
      | <Numeric>              -- (NUM)eric
      | "(" <Tree> <Tree> ")"  -- (CON)structor
      | "{" <Tree> <Tree> "}"  -- (DUP)licator
      | "$(" <Tree> <Tree> ")" -- (OPE)rator
      | "?(" <Tree> <Tree> ")" -- (SWI)tch

  <Tree> ::=
      | <alphanumeric>         -- (VAR)iable
      | <Node>

  <alphanumeric> ::= [a-zA-Z0-9_.-/]+
  ``` \
  _(For details on the `<Numeric>` syntax, see Numbers (@numbers))_
] \
  \

Notice that Nodes form a tree-like structure, and throughout this document we will make systematic confusion between Nodes and Trees. For example, in Interactions (@interactions), it is critical to know when a Variable is permissible or not when referring to arbitrary Nodes. In Memory Layout (@memory), however, we purposefully blur the line between Nodes and Variables as the memory layout is greatly simplified by doing so.

The first three node types (`ERA`, `REF`, `NUM`) are nullary, and the last 4 types (`CON`, `DUP`, `OPE`, `SWI`) are binary. As implied above, `VAR` can be seen as an additional node type. As in Lafont's Interaction Nets, every node has an extra distinguished edge, called the main or principal port @Lafont_1997. Thus, nullary nodes have one port (one main and zero auxiliary), while binary nodes have three ports (one main and two auxiliary).

With the syntax above, the main port of the root node is "free", as it is not wired to another port. We can connect two main ports and form a reducible expression (redex), using the following syntax:

#align(center)[
  ```
  <Redex> ::= <Tree> "~" <Tree>
  ```
]

A Net consists of a root tree and a (possibly empty) list of `&`-separated redexes:

#align(center)[
  ```
  <Net> ::= <Tree> ("&" <Redex>)*
  ```
]

#pagebreak()

HVM2 Nets represent what are known as _configurations_ in the literature. A Net like
#align(center)[
  ```
  t1 & v1 ~ w1
     & ..
     & vn ~ wn
  ```
]

graphically represents a net like,

#pad(1em)[
  #figure(
    image("configuration.png", width: 50%),
    caption: [
      A configuration #footnote[This is a modified image
 of a configuration with multiple free main ports @Salikhmetov_2016.]
    ],
  )
]

where $omega$ is a wiring. Thus, HVM2 Nets contain only a single free main port. Wirings are possible between trees through pairs of `VAR` nodes with the same names. Note, however, that a variable can only occur twice. This aligns with Interaction Nets in that a wire can only connect two ports.

Lastly, an Book consists of a list of top-level definitions, or, "named" Nets:

#align(center)[
  ```
  <Book> ::= ("@" <name> "=" <Net>)*
  ```
]

Each `.hvm2` file contains a book, which is executed by HVM2. The entry point for HVM2 programs is the `@main` definition.

== An Example

The following definition:
#align(center)[
  ```
  @succ = ({(a b) (b R)} (a R))
  ```
]

Represents the HVM2 encoding of the $lambda$-calculus term `λs λz (s (s z))`, and can be drawn as the following Interaction Combinator net:
/*
          :
         /_\
      ...: :...
      :       :
     /#\     /_\
   ..: :..   a R
  /_\   /_\  : :
  a b...b R..:.:
  :..........:
*/
#figure(caption: [An example $lambda$-calculus term.], canvas({
  import draw: *

  // inet.con(name: "a", pos: (0, 2), rot: -90deg)
  // inet.con(name: "b", pos: (2, 0), rot: -180deg)
  // inet.con(name: "c", pos: (0, -2), rot: 90deg)
  inet.con(name: "a", pos: (0, 0), rot: 90deg)
  inet.dup(name: "b", pos: (1.5, -1), rot: 90deg)
  inet.con(name: "c", pos: (1.5, 1), rot: 90deg)
  inet.con(name: "d", pos: (3, -2), rot: 90deg)
  inet.con(name: "e", pos: (3, 0), rot: 90deg)
  inet.link("a.0", "a.0")

  inet.link("a.1", "b.0")
  inet.link("a.2", "c.0")
  inet.link("b.1", "d.0")
  inet.link("b.2", "e.0")

  inet.link("d.2", "e.1")
  content((3.5, -1), [`b`])

  // "c.2" -> "e.2"
  inet.port("R", (3.45, 1 + 1/6), -90deg)
  inet.port("R'", (3.45, 1 + 1/6), 90deg)
  inet.link("e.2", "R")
  inet.link("c.2", "R'")
  content((3, 1.5), [`R`])

  // "c.1" -> "d.1"
  inet.port("A1", (4.1, -1), -180deg)
  inet.port("A2", (4.1, 0), 0deg)
  inet.port("A3", (3.45, 1 - 1/6), 90deg)
  inet.port("A3'", (3.45, 1 - 1/6), -90deg)
  inet.link("d.1", "A1")
  inet.link("A1", "A2")
  inet.link("A2", "A3'")
  inet.link("c.1", "A3")
  content((4.4, -0.5), [`a`])
}))

Notice how `CON`/`DUP` nodes in HVM2 correspond directly to constructor and duplicator nodes in Lafont's Interaction Combinators @Lafont_1997. Aux-to-main wires are implicit through the tree-like structure of the syntax, while aux-to-aux wires are explicit through variable nodes, which are always paired.

Additionally, main ports being implicit is critical to storing nodes efficiently in memory. Nodes can be represented as just two ports (the HVM2 memory model for wires) rather than three. In HVM2, since every port is 32 bits, this allows us to store a single node in a 64-bit word. This compact representation lets us use built-in atomic operations in various parts of the code, which was key to making parallel C and CUDA versions efficient. For details on the precise memory representation, see @architecture.

== Interpretation of the Syntax

Semantically, `CON`, `DUP`, and `ERA` nodes correspond accordingly to Lafont's #emph[constructor], #emph[duplicator], and #emph[eraser] symbols @Lafont_1997, and behave like Mazza's Symmetric Interaction Combinators @Mazza_2007. The `VAR` node represents a wiring in the graph, connecting two ports of a Net. They are linear and paired in the sense that (except for the free main port) every port is connected to exactly one other port, and therefore each variable occurs exactly twice.

`REF` nodes are an extension to Lafont's IC's, and they represent an immutable net that is expanded in a single interaction. While not essential for the expressivity of the system, `REF` nodes are essential for performance, as they enable fast global functions, a degree of laziness in a strict setup (critical to making GPU implementations viable), and allow us to represent tail recursion in constant space.

`NUM`, `OPE` and `SWI` nodes are also not essential expressivity-wise, but are too important for performance reasons. Modern processors are equipped with native machine integer operations. Emulating these operations with IC constructs analogous to Church or Scott Numerals would be very inefficient. Thus, these numeric nodes are necessary for HVM2 to be efficient in practice.

#pagebreak()

= Interactions <interactions>

The AST above specifies HVM2's data format. As a virtual machine, it also provides a mechanism to compute with that data. In traditional VMs, these are called *instructions*. In term rewriting systems, there are usually *reductions*. In HVM2, the mechanism for computation is called *interactions*. There are ten of them. All interactions are listed below using Gentzen-style rules (redexes in the line above reduce to ones in the line below).

#v(1em)
$
  & "Arbitrary Nodes (including" #raw("VAR") ")" #h(2em) & #raw("A"), #raw("B"), #raw("C"), #raw("D") &:= #raw("<Tree>") \
  & "Binary Nodes" #footnote[In the interaction rules `()` and `{}` refer to arbitrary binary nodes, not just `CON` and `DUP`.] & #raw("()"), #raw("{}") &:= #raw("CON") | #raw("DUP") | #raw("OPA") | #raw("SWI") \
  & "Nullary Nodes" & circle.filled.small, circle.stroked.small &:= #raw("ERA") | #raw("REF") | #raw("NUM") \
  & "Numeric Nodes" & #raw("N"), #raw("M") &:= #raw("NUM") "(Numbers or Operations)"\
  & "Numeric Value Nodes (Not Operators)" & #raw("#n"), #raw("#m") &:= #raw("NUM") "where" #raw("n"), #raw("m") in bb(Q) \
  & "Erasure Nodes" & #raw("*") &:= #raw("ERA") \
  & "Variables" & #raw("x"), #raw("y"), #raw("z"), #raw("w") &:= #raw("VAR") \

$

#v(1em)
#v(1em)
#show math.equation: set text(15pt)
#grid(
  align: center,
  columns: (1fr, 1fr),
  gutter: 1pt,
  [
    #smallcaps[(#strong[link])]
    #math.frac(
      [
        `B` contains `x` \
        `x ~ A`
      ],
      [`B[x` $arrow.l$ `A]`]
    )
  ],
  [
    #smallcaps[(#strong[call])]
    #math.frac(
      [
      `A` is not a `VAR` node
      #v(1em) \
      `@foo ~ A`
    ],
    `expand(@foo) ~ A`)
  ],

)
#v(1em)
#grid(
  align: center,
  columns: (1fr, 1fr),
  gutter: 1pt,
  [
    #smallcaps[(#strong[void])]
    #math.frac([$circle.filled.small$ `~` $circle.stroked.small$], ``)
  ],
  [
    #smallcaps[(#strong[eras]e)]
    #math.frac([$circle.filled.small$ `~ (A B)`],
      [
        $circle.filled.small$ `~ A` \
        $circle.filled.small$ `~ B`
      ])
  ],
)
#v(1em)
#grid(
  align: center,
  columns: (1fr, 1fr),
  gutter: 1pt,
  [
    #smallcaps[(#strong[comm]ute)]
    #math.frac(`(A B) ~ {C D}`,
      ```
      {x y} ~ A
      {z w} ~ B
      (x z) ~ C
      (y w) ~ D
      ```
    )
  ],
  [
    #smallcaps[(#strong[anni]hilate)]
    #math.frac(
      `(A B) ~ (C D)`,
      ```
      A ~ C
      B ~ D
      ```
    )
  ],

)
#v(1em)
#grid(
  align: center,
  columns: (1fr, 1fr),
  gutter: 1pt,
[
    #smallcaps[(#strong[oper]ate 1)]
    #math.frac(`N ~ $(M A)`, `op(N, M) ~ A`)
  ],
  [
    #smallcaps[(#strong[swit]ch 1)]
    #math.frac(`#0 ~ ?(A B)`, `A ~ (B *)`)

  ],

)
#v(1em)
#grid(
  align: center,
  columns: (1fr, 1fr),
  gutter: 1pt,
  [
    #smallcaps[(#strong[oper]ate 2)]
    #math.frac(
      [
        `A` is not a `NUM` node
        #v(1em) \
        `N ~ $(A B)`
      ],
      `A ~ $(N B)`
    )
  ],
  [
    #smallcaps[(#strong[swit]ch 2)]
    #math.frac(`#n+1 ~ ?(A B)`, `A ~ (* (#n B))`)
  ],
)
#v(2em)
#show math.equation: set text(10pt)

Note that rules are #emph[symmetric]: if a rule applies to a redex `A ~ B` then it also applies to `B ~ A`. Explanations and implementation details of each of the rules follow.

== Link
"Links" two ports where at least one is a `VAR` Node. A *global substitution* is performed replacing the single other occurrence of `x` with `A`. Recall that there is _exactly_ one other occurrence of `x` in the net. In the graph-rewriting system of Interaction Nets, linking two ports isn't technically an interaction, as wires are not named. However, for the term-rewriting calculus, wires are named and must be substituted for. When applying this rule, the redex `x ~ A` is removed, and the occurrence of `x` in `B` node is replaced with `A`. This is the only rule where nodes "far apart" can affect each other. See <linker> for details on the `link` function.

== Call
Expands a `REF`, replacing it with its definition. The definition is essentially copied from the static Book to the global memory, allocating its nodes and creating fresh variables. This operation is key to enable fast function application, since, without it, one would need to use duplicator nodes for the same purpose, which brings considerable overhead. It also introduces some laziness in a strict evaluator, allowing for global recursive functions, and constant-space tail-calls.

== Void
Erases two nullary nodes connected to each other. The result is nothing: both nodes are consumed, fully cleared from memory. The `VOID` rule completes a garbage collection process.

== Erase
Erases a binary node `(A B)` connected to an nullary node, propagating the nullary node towards both nodes `A` and `B`. The rule performs a granular, parallel garbage collection of nets that go out of scope.

When the nullary node is a `NUM` or a `REF`, the #smallcaps[erase] rule actually behaves as a copy operation, cloning the `NUM` or `REF`, and connecting to both ports. #emph[However], when a copy operation is applied to a `REF` which contains `DUP` nodes, it instead is computed as a normal #smallcaps[call] operation. This allows us to perform fast copy of "function pointers", while still preserving Interaction Combinator semantics.

== Commute
Commutes two binary nodes of different types, essentially cloning them. The #smallcaps[commute] rule can be used to clone data and to perform loops and recursion, although these are preferably done via #smallcaps[call]s: Cloning large networks is faster through the #smallcaps[call] interaction, as it can be done in a single pass, as opposed to the incremental #smallcaps[commute] interactions that would have to propagate throughout the network.

== Annihilate
Annihilates two binary nodes of the same type connected to each-other, replacing them with two redexes. The #smallcaps[annihilate] rule is the most essential computation rule, and is used to implement beta-reduction and pattern-matching.

== Operate
Performs a numeric operation between two `NUM` nodes `N` and `M` connected by an `OPE` node. Note that `N` and `M` should not both be numeric values for this interaction to perform a sensible numeric operation. Dispatching to different native numeric operations depends on the `N` and `M` nodes themselves. See Numbers (@numbers) for details.

Note than when counting the number interactions, #smallcaps[operate 2] is not counted, as this would cause the number of interactions to be non-deterministic.

== Switch
Performs a switch on a `NUM` node `#n` connected to a `SWI` node, treating it like a `Nat ::= Zero | (Succ pred)`. Here, `A` is expected to be a #highlight[tuple (first reference of the term "tuple", it's unclear what the encoding is)] with both cases: `zero` and `succ`, and `B` is the return port. If `n` is 0, we return the `zero` case, and erase the `succ` case. Otherwise, we return the `succ` case applied to `n-1`, and erase the `zero` case.

#pagebreak()

== Interaction Table

Since there are eight node types, there is a total of 64 possible pairwise node interactions. The interaction rule for an active pair is _uniquely_ determined by the types of the two nodes in the pair. The table below shows which interaction rule is triggered for each possible pair of nodes that form a redex. Since the interaction rules are symmetric, this table is symmetric across the main diagonal. #highlight[`CON`-`SWI` being #smallcaps[comm] is problematic; should be removed or the reduction for #smallcaps[swit] should change].
#v(1em)

#align(center)[
  ```
  | A\B |  VAR |  REF |  ERA |  NUM |  CON |  DUP |  OPR |  SWI |
  |-----|------|------|------|------|------|------|------|------|
  | VAR | LINK | LINK | LINK | LINK | LINK | LINK | LINK | LINK |
  | REF | LNIK | VOID | VOID | VOID | CALL | ERAS | CALL | CALL |
  | ERA | LINK | VOID | VOID | VOID | ERAS | ERAS | ERAS | ERAS |
  | NUM | LINK | VOID | VOID | VOID | ERAS | ERAS | OPER | SWIT |
  | CON | LINK | CALL | ERAS | ERAS | ANNI | COMM | COMM | COMM |
  | DUP | LINK | ERAS | ERAS | ERAS | COMM | ANNI | COMM | COMM |
  | OPR | LINK | CALL | ERAS | OPER | COMM | COMM | ANNI | COMM |
  | SWI | LINK | CALL | ERAS | SWIT | COMM | COMM | COMM | ANNI |
  ```
]

\

Because for each active pair exactly one rule applies, HVM2 retains the same _strong confluence_ that Lafont's Interaction Combinators do @Lafont_1997. This implies that not only can HVM2 programs be reduced completely in parallel, but also that the number of reductions is invariant to the order in which interaction rules are applied. This ensures that HVM2 can reduce redexes in any order without any risk of complexity blowups.

#pagebreak()

= Substitution Map & Atomic Linker <linker>

While HVM2 retains the strong confluence property of Lafont's IC's, _locality_ is more difficult to obtain. This is due to HVM2's variables. Variables link two different parts of the program, and thus can cause interference when two threads attempt to reduce two redexes in parallel. For example, consider a subset of a Net:

#v(1em)
#align(center)[
  ```
  & (a b) ~ (d c)
  & (c d) ~ (f e)
  ```
]
#v(1em)

Two threads attempting to reduce this Net can be represented as follows:

#v(1em)
#align(center)[
  ```
       Thread_0     Thread_1
  --a--|\____/|--c--|\____/|--e--
  --b--|/    \|--d--|/    \|--f--
  ```
]
#v(1em)

Notice that in the reduction of these two redexes, both `Thread_0` and `Thread_1` will need to access variables `c` and `d`. This requires synchronization.

In HVM2, there is a global collection of redexes that is mutated in parallel by a variety of threads. See Architecture (@architecture) for details. Since every variable occurs exactly twice, the #smallcaps[link] interaction with a variable `x` will also occur twice, but _possibly at very different times_. The first time is when this variable is first encountered, and somehow a substitution must be "deferred" until the #smallcaps[link] interaction rule is applied to the second occurrence of `x`.

This is accomplished by a global, atomic substitution map, which tracks these deferred substitutions. When a variable is linked to a node, or to another variable, it is inserted into the substitution map. When that same variable is linked again, it will already have an entry in the substitution map, and then the proper redex will be constructed.

The substitution map can be represented efficiently with a flat buffer, where the index is the variable name, and the value is the node that has been substituted. This can be done atomically, via a simple lock-free linker. In pseudocode, this roughly looks like:

#pagebreak()

#align(center)[
```python
# Attempts to link A and B.
def link(subst: Dict[str, Node], A: Node, B: Node):
    while True:
        # If A is not a VAR: swap A and B, and continue.
        if type(A) != VAR:
            swap(A, B)

        # If A is not a VAR: both are non-vars. Create a new redex.
        if type(A) != VAR:
            push_redex(A, B)

        # Here, A is a VAR. Create a `A: B` entry in the map.
        got: Port = subst.set_atomic(A, B)

        # If there was no `A` entry, stop.
        if got is None:
            break

        # Otherwise, delete `A` and link `got` to `B`.
        del subst[A]
        A = got
```
]
#v(1em)

To see how this algorithm works, let's consider, again, the scenario above:

#v(1em)
#align(center)[
  ```
       Thread_0     Thread_1
  --a--|\____/|--c--|\____/|--e--
  --b--|/    \|--d--|/    \|--f--
  ```
]
#v(1em)

Assume we start with a substitution `a` $arrow.l$ `#42`, and let both threads reduce a redex in parallel. Each thread  are an `ANNI` rule, their effect is to link both ports; thus, the resulting wire connected to `#42` (with the wires `a`, `d`, and `e` labelled for clarity) should be

#v(1em)
#align(center)[
  ```
  #42 ---a---.         .---e---
              '---d---'
  ```
]
#v(1em)

That is, `e` must be directly linked to `#42`. Let's now evaluate the algorithm in an arbitrary order, step-by-step. Recall that the initial Net is:

#v(1em)
#align(center)[
  ```
  & (a b) ~ (d c)
  & (c d) ~ (f e)
  ```
]
#v(1em)

And for simplicity we're observing only ports `a`, `d`, and `e`. `Thread_0` will attempt to perform `link(a, d)` and `Thread_1` will attempt `link(d, e)`. There are many possible orders of execution:

#pagebreak()

== Possible Execution Order 1

#pad(left: 2em)[
  ```
  - a: #42
  ======= Thread_2: link(d, e)
  - a: #42
  - d: e
  ======= Thread_1: link(a, d)
  - a: d
  - d: e
  ======= Thread_1: got `a: #42`, thus, delete `a` and link(d, #42)
  - d: #42
  ======= Thread_1: got `d: e`, thus, delete `d` and link(e, #42)
  - e: #42
  ```
]

The resulting substitution map is linking `e` to `42`, as required.

== Possible Execution Order 2

#pad(left: 2em)[
  ```
  - a: #42
  ======= Thread_1: link(d, a)
  - a: #42
  - d: a
  ======= Thread_2: link(d, e)
  - a: #42
  - d: e
  ======= Thread_2: got `d: a`, thus, delete `d` and link(a, e)
  - a: e
  ======= Thread_2: got `a: #42`, thus, delete `a` and link(e, #42)
  - e: 42
  ```
]

The resulting substitution map is linking, again, `e` to `42`, as required.

== Possible Execution Order 3

#pad(left: 2em)[
  ```
  - a: #42
  ======= Thread_1: link(d, a)
  - a: #42
  - d: a
  ======= Thread_2: link(e, d)
  - a: #42
  - d: a
  - e: d
  ```
]

In this case, the result isn't directly linking `e` to `#42`. But it does link `e` to `d`, which links to `a`, which links to `#42`. Thus, `e` is, indirectly, linked to `#42`. While it does temporarily use more memory in this case, it is, semantically, the same result. Additionally, the indirect links will be cleared as soon as `e` is linked to something else. It is easy enough to see that this holds for all possible evaluation orders.

#pagebreak()

= Numbers <numbers>
HVM2 has a built-in support for 32-bit numerics and operations represented by the `NUM` node type. `NUM` nodes in HVM2 have a 5-bit tag and a 24-bit payload. Depending on the tag, numbers can represent unsigned integers (U24), signed integers (I24), IEEE 754 binary32 floats (F24), or (possibly partially applied) operators. These choices mean any numeric node can be represented in 29 bits, which can be unboxed in a 32-bit port with a 3 bit tag for the node type.

== Syntax

Numeric nodes can either be a number or a (possibly partially applied) operator. Syntactically,

#align(center)[
  ```
  <Numeric> ::=
      | <Number>
      | <Operator>

  <Number> ::=
      | <Nat>
      | <Int>
      | <Float>

  <Operator> ::=
      | "[" <Operation> "]"           -- (unapplied)
      | "[" <Operation> <Number> "]"  -- (partially applied)

  <Operation> ::=
      | "+"   -- (ADD)
      | "-"   -- (SUB)
      | "*"   -- (MUL)
      | "/"   -- (DIV)
      | "%"   -- (REM)
      | "="   -- (EQ)
      | "!"   -- (NEQ)
      | "<"   -- (LT)
      | ">"   -- (GT)
      | "&"   -- (AND)
      | "|"   -- (OR)
      | "^"   -- (XOR)
      | ">>"  -- (SHR)
      | "<<"  -- (SHL)
      | ":-"  -- (FP_SUB)
      | ":/"  -- (FP_DIV)
      | ":%"  -- (FP_REM)
      | ":>>" -- (FP_SHR)
      | ":<<" -- (FP_SHL)
  ```
]

where `<Int>` is disambiguated from `<Nat>` by requiring a sign prefix `+`/`-`, and flipped versions of non-commutative operators (`FP_*`) are provided for convenience.

#pagebreak()

== Numeric Node Memory Layout

In order to understand how numeric operations are derived from the numeric nodes, we must look at the memory layout of the different numeric values and operations. As stated above, numeric nodes fit into 29 bits, so they can be inlined into a 32-bit ports. Numeric nodes have the following layout

#align(center)[
  ```
  VVVVVVVVVVVVVVVVVVVVVVVVTTTTT
  ```
]

Where the 5-bit tag `T` has 19 possible values: one of the three number types (`U24`, `I24`, `F24`), one of the 15 operations, or a special value `SYM`. The 24-bit value `V` has a few possible interpretations:
- If $#raw("T") in {#raw("U24"), #raw("I24"), #raw("F24")}$, then `V` is interpreted as a number with the type `T`. For example: `123`, `-123`, `1.0`.
- If $#raw("T") in #raw("<Operation>")$, then `V` is an untyped number. This is a partially applied operation. The type of the second argument (when applied) will dictate the interpretation of `V`. For example: `[/123]`, `[*-123]`, `[%1.0]`.
- If $#raw("T") = #raw("SYM")$, then $#raw("V") in #raw("<Operator>")$. This is an unapplied operator, like `[+]`.

== U24 - Unsigned 24-bit Integer

U24 numbers represent unsigned integers from 0 to 16,777,215 ($2^24 - 1$).

The 24-bit payload directly encodes the integer value. For example:

#align(center)[
  ```
  0000 0000 0000 0000 0000 0001 = 1
  0000 0000 0000 0000 0000 0010 = 2
  1111 1111 1111 1111 1111 1111 = 16,777,215
  ```
]

== I24 - Signed 24-bit Integer

I24 numbers represent signed integers from -8,388,608 to 8,388,607.

The 24-bit payload uses two's complement encoding. For example:

#align(center)[
  ```
  0000 0000 0000 0000 0000 0000 = 0
  0000 0000 0000 0000 0000 0001 = 1
  0111 1111 1111 1111 1111 1111 = 8,388,607
  1000 0000 0000 0000 0000 0000 = -8,388,608
  1111 1111 1111 1111 1111 1111 = -1
  ```
]

== F24 - 24-bit IEEE 754 binary32 Float

F24 numbers represent a subset of IEEE 754 binary32 floating point numbers; it supports approximately the same range, but with less precision. The 24-bit payload is laid out as follows:

#align(center)[
  ```
  SEEE EEEE EMMM MMMM MMMM MMMM
  ```
]

Where:
- S is the sign bit (1 = negative, 0 = positive)
- E is the 8-bit exponent, with a bias of 127 (range of −126 to +127)
- M is the 15-bit significand (mantissa) precision

The value is calculated as:
- If E = 0 and M = 0, the value is signed zero
- If E = 0 and M $eq.not$ 0, the value is a subnormal number: \
  $(-1)^S times 2^(-126) times (0.M "in base 2")$
- If 0 < E < 255, the value is a normal number: \
  $(-1)^S times 2^(E-127) times (1.M "in base 2")$
- If E = 255 and M = 0, the value is signed infinity
- If E = 255 and M $eq.not$ 0, the value is NaN (Not-a-Number)

F24 supports a range of approximately $plus.minus 3.4 times 10^38$. The smallest positive normal number is $2^(-126) approx 1.2 times 10^(-38)$, while the smallest subnormal numbers go down to $2^(-141) approx 3.6 times 10^(-43)$.

== Numeric Operations

When two `NUM` nodes are connected by an `OPE` node, as shown in Interaction Rules (@interactions), a numeric operation is performed using the `op` function. The operation to be performed depends on the tags of each numeric node.

Some operations `op(N, M)` are invalid, and simply return `0`:
- If both numeric tags are types.
- If both numeric tags are operations.
- If both numeric tags are `SYM`.

Otherwise:
- If one of the tags is `SYM`, the output has the tag represented by the `SYM` numeric node and the payload of the other operand. For example,

#align(center)[
  ```
  OP([+], 10) = [+10]
  OP(-1, [*]) = [*0xffffff]
  ```
]

- If one of the tags is an operation, and the other is a type, a native
  operation is performed, according to the following table:

#align(center)[
  ```
  |   | U24 | I24 | F24   |
  |---|-----|-----|-------|
  |ADD| +   | +   | +     |
  |SUB| -   | -   | -     |
  |MUL| *   | *   | *     |
  |DIV| /   | /   | /     |
  |REM| %   | %   | %     |
  |EQ | ==  | ==  | ==    |
  |NEQ| !=  | !=  | !=    |
  |LT | <   | <   | <     |
  |GT | >   | >   | >     |
  |AND| &   | &   | atan2 |
  |OR | |   | |   | log   |
  |XOR| ^   | ^   | pow   |
  |SHR| >>  |     |       |
  |SHL| <<  |     |       |
  ```
]
Where empty cells are intentionally left unspecified.

The resulting number type is the same as the input type, except for comparison operators (`EQ`, `NEQ`, `LT`, `GT`) which always return `U24` `0` or `1`.

#pagebreak()

The number tagged with the operation is the left operand of the native operation, and the number tagged with the type is the right operand. For example,

#align(center)[
  ```
  op([/1.2], 3.0) = op(3.0, [/1.2]) = 1.2 / 3.0
  ```
]

Note that this means that the number type used in an operation is always determined by the right operand; if the left operand is of a different type, its bits will be reinterpreted.

Finally, flipped operations (such as `FP_SUB`) interpret their operands in the opposite order (e.g. `SUB` represents `a - b` whereas `FP_SUB` represents `b - a`). This allows representing e.g. both `1 - x` and `x - 1` with partially-applied operations (`[-1]` and `[:-1]` respectively).

#align(center)[
```
OP([-2], +1) = +1
OP([:-2], 1) = -1
```
]

Note that `op` is a symmetric function (since the order of the operands is determined by their tags). That is, `op(N, M) = op(M, N)`. This makes the "swap" interaction in #smallcaps[operate 2] valid.

#pagebreak()

= The 32-bit Architecture <architecture>

The initial version of HVM2, as implemented in this paper, is based on a 32-bit architecture. In this section, we'll use snippets of the Rust reference implementation of the interpreter, available at the #link("https://github.com/HigherOrderCO/hvm2")[HVM2 repo], to document this architecture.

== Memory Layout <memory>

Ports are 32-bit values that represent a wire connected to a main port. The low 3-bits are reserved to identify the type of the node (`VAR`, `REF`, `ERA`, etc) whose main port the wire is connected to. This is a port's _tag_. The upper 29-bits hold a port's _value_. The interpretation of the value is dependent on the tag. The value is either an address (for binary `CON`, `DUP`, `OPR`, and `SWI` nodes), a virtual function address (for `REF` nodes), an unboxed 29-bit number (for `NUM` nodes), a variable name (for `VAR` nodes), or `0` (for `ERA` nodes). Binary nodes are represented in memory as a pair of two ports. Notice that _ports store the kind of node they are connecting to_, nodes don't store their own type.

#align(center)[
```rust
pub type Tag = u8;  // 3 bits  (rounded up to u8)
pub type Val = u32; // 29 bits (rounded up to u32)

pub struct Port(pub Val); //  Tag + Val  (32 bits)
pub struct Pair(pub u64); // Port + Port (64 bits)
```
]

The Global Net structure includes three shared memory buffers, `node`: a node buffer where nodes are allocated, `vars`: a buffer representing the current substitution map (with the 29-bit key being the variable name, and the value being the current substitution), and `rbag`: a collection of active redexes. `APair` and `APort` are atomic variants of `Pair` and `Port`, respectively,

#align(center)[
```rust
pub struct GNet<'a> {
  pub node: &'a mut [APair],
  pub vars: &'a mut [APort],
  pub rbag: &'a mut [APair],
}
```
]

Since this 32-bit architecture has 29-bit values, that means we can address a total of $2^29$ nodes and variables, making the `node` buffer at most `4 GB` long, and the `vars` buffer at most `2 GB` long. A 64-bit architecture would increase this limit to match the overwhelming majority of use cases, and will be incorporated in a future revision of the HVM2 runtime.

Since top-level definitions are just static nets, they are stored in a similar structure, with the key difference that they include an explicit `root` port, which, is used to connect the expanded definition its target port in the graph. We also include a `safe` flag, which indicates whether this definition has duplicator nodes or not. This affects the `DUP-REF` interaction, which will just copy the `REF` port, rather than expanding the definition, when it is `safe`.

#align(center)[
```rust
pub struct Def {
  pub safe: bool,      // has no dups
  pub root: Port,      // root port
  pub rbag: Vec<Pair>, // def redex bag
  pub node: Vec<Pair>, // def node buffer
}
```
]

Finally, the global Book is just a map of names to `Def`s:

#align(center)[
```rust
pub struct Book {
  pub defs: Vec<Def>,
}
```
]

Notice that, like variables, names are simply indexes into the global `Book`.

This concludes HVM2's memory layout. For more details, check the reference Rust implementation in the #link("https://github.com/HigherOrderCO/hvm2")[HVM2 repo].

== Example Interaction Net Layout

Consider, again, the following net:

#align(center)[
```
(a b)
& (b a) ~ (x (y *))
& {y x} ~ @foo
```
]

In HVM2's memory, it would be represented as:

#align(center)[
```
RBAG | FST-TREE | SND-TREE
---- | -------- | --------
0800 | CON 0001 | CON 0002 // '& (b a) ~ (x (y *))'
1800 | DUP 0005 | REF 0000 // '& {x y} ~ @foo'
---- | -------- | --------
NODE | PORT-1   | PORT-2
---- | -------- | --------
0001 | VAR 0000 | VAR 0001 // '(a b)' node (root)
0002 | VAR 0001 | VAR 0000 // '(b a)' node
0003 | VAR 0002 | CON 0004 // '(x (y *))' node
0004 | VAR 0003 | DUP 0000 // '(y *)' node
0005 | VAR 0003 | VAR 0002 // '{y x}' node
---- | -------- | --------
VARS | VALUE    |
---- | -------- |
FFFF | CON 0001 | // points to root node
```
]

Note that the `VARS` buffers has only one entry, because there are no substitutions, but we always use the last variable to represent the root port, serving as an entry point to the graph.

== Example Interaction

Interactions can be implemented in five steps:
1. Allocate the needed resources.

2. Loads nodes from global memory to registers.

3. Initialize fresh variables on the substitution map.

4. Stores fresh nodes on the node buffer.

5. Atomically links outgoing wires.

For example, the #smallcaps[commute] interaction is implemented in Rust as:

#align(center)[
```rust
pub fn interact_comm(&mut self, net: &GNet, a: Port, b: Port) -> bool {
  // Allocates needed resources.
  if !self.get_resources(net, 4, 4, 4) {
    return false;
  }

  // Loads nodes from global memory.
  let a_ = net.node_take(a.get_val() as usize);
  let a1 = a_.get_fst();
  let a2 = a_.get_snd();
  let b_ = net.node_take(b.get_val() as usize);
  let b1 = b_.get_fst();
  let b2 = b_.get_snd();

  // Stores new vars.
  net.vars_create(self.v0, NONE);
  net.vars_create(self.v1, NONE);
  net.vars_create(self.v2, NONE);
  net.vars_create(self.v3, NONE);

  // Stores new nodes.
  net.node_create(self.n0, pair(port(VAR, self.v0), port(VAR, self.v1)));
  net.node_create(self.n1, pair(port(VAR, self.v2), port(VAR, self.v3)));
  net.node_create(self.n2, pair(port(VAR, self.v0), port(VAR, self.v2)));
  net.node_create(self.n3, pair(port(VAR, self.v1), port(VAR, self.v3)));

  // Links.
  self.link_pair(net, pair(port(b.get_tag(), self.n0), a1));
  self.link_pair(net, pair(port(b.get_tag(), self.n1), a2));
  self.link_pair(net, pair(port(a.get_tag(), self.n2), b1));
  self.link_pair(net, pair(port(a.get_tag(), self.n3), b2));

  return true;
}
```
]
#v(1em)

Note that, other than the linking, all operations here are local. Taking nodes
from global memory is safe, because the thread that holds a redex implicitly
owns both trees it contains, and storing vars and nodes is safe, because these
spaces have been allocated by the thread. A fast concurrent allocator for small
values is assumed. In HVM2, we just use a simple linear bump allocator, which is
fast and fragmentation-free in a context where all allocations are at most
64-bit values (the size of a single node).

= Massively Parallel Evaluation

Provided the architecture we just constructed, evaluating an HVM2 program in
parallel is surprisingly easy: *just compute global redexes concurrently, until
there is no more work to do*.

HVM2's local interactions exposes the original program's full degree of
parallelism, ensuring that every work that *can* be done in parallel *will* be
done in parallel. In other words, it maximizes the theoretical speedup, per
Amdahl's law. The atomic linking procedure ensures that points of
synchronization that emerge from the original program are solved safely and
efficiently, without no room for race conditions. Finally, the strong confluence
property ensures that the total work done is independent of the order that
redexes are computed, giving us freedom to evaluate in parallel without
generating extra work.

== Redex Sharing

An additional question, is, how do we actually distribute that workload through
all cores of a modern processor? The act of sharing a redex is, itself, a point
of synchronization. If this is done without enough caution, it can result in
contention, and slowing up execution. HVM2 solves this by two different
approaches:

On _CPUs_, a simple task-stealing queue is used, where each thread pushes and
pops from its own local redex bag, while a starving neighbor thread actively
attempt to steal a redex from it. Since a redex is just a 64-bit value, stealing
can be done with a single atomic_exchange operation, making it very lightweight.
To reduce contention, and to force threads to steal "old redexes", which are
more likely to produce long independent workloads, this stealing is done from
the other end of the bag. In our experiences, this works extremely well in
practice, achieving full CPU occupancy in all cases tested, with minimal
overhead, and low impact on non-parallelizable programs.

On _GPUs_, this matter is more complex in many ways. First, there are two scales
on which we want sharing to occur:

1. Within a running block, where stealing between local threads can be
   accomplished by fast shared-memory operations and warp-sync primitives.

2. Across global blocks, where sharing requires either a global synchronization
   (i.e., calling the kernel again) or direct communication via global memory.

Unfortunately, the cost of global synchronization (i.e., across blocks) is very
high, so, having a globally shared redex bag, as in the C version, and accessing
it within the context of a kernel, would greatly impact performance. To improve
this, we, initially, attempted to implement a fast block-wise scheduler, which
simply lets local threads pass redexes to starving ones with warp syncs:

```c
__device__ void share_redexes(TM* tm) {
  __shared__ Pair pool[TPB];
  Pair send, recv;
  u32*  ini = &tm->rbag.lo_ini;
  u32*  end = &tm->rbag.lo_end;
  Pair* bag = tm->rbag.lo_buf;
  for (u32 off = 1; off < 32; off *= 2) {
    send = (*end - *ini) > 1 ? bag[*ini%RLEN] : 0;
    recv = __shfl_xor_sync(__activemask(), send, off);
    if (!send &&  recv) bag[((*end)++)%RLEN] = recv;
    if ( send && !recv) ++(*ini);
  }
  for (u32 off = 32; off < TPB; off *= 2) {
    u32 a = TID();
    u32 b = a ^ off;
    send = (*end - *ini) > 1 ? bag[*ini%RLEN] : 0;
    pool[a] = send;
    __syncthreads();
    recv = pool[b];
    if (!send &&  recv) bag[((*end)++)%RLEN] = recv;
    if ( send && !recv) ++(*ini);
  }
}
```

Such procedure is efficient enough to be called between every few interactions,
allowing redexes to quickly fill the whole block. With that, all we had to do is
let the kernel perform a constant number of local interactions (usually in the
range of $2^9$ to $2^13$), and, once it completes, i.e., across kernel invocations,
the global redex bag was transposed (rows become columns), letting the entire
GPU to fill naturally by just the block-wise sharing function above, and nothing
else. This approach worked very well in practice, and let us achieve a peak of
74,000 MIPS in a shader-like functional program (i.e., a "tree-map" operation).

Unfortunately, it didn't work so well in cases where the implied communication
was more involved. For example, consider the following implementation of a
purely functional Bitonic Sort:

```javascript
data Tree = (Leaf val) | (Node fst snd)

// Swaps distant values in parallel; corresponds to a Red Box
(warp s (Leaf a)   (Leaf b))   = (U60.swap (^ (> a b) s) (Leaf a) (Leaf b))
(warp s (Node a b) (Node c d)) = (join (warp s a c) (warp s b d))

// Rebuilds the warped tree in the original order
(join (Node a b) (Node c d)) = (Node (Node a c) (Node b d))

// Recursively warps each sub-tree; corresponds to a Blue/Green Box
(flow s (Leaf a))   = (Leaf a)
(flow s (Node a b)) = (down s (warp s a b))

// Propagates Flow downwards
(down s (Leaf a))   = (Leaf a)
(down s (Node a b)) = (Node (flow s a) (flow s b))

// Bitonic Sort
(sort s (Leaf a))   = (Leaf a)
(sort s (Node a b)) = (flow s (Node (sort 0 a) (sort 1 b)))
```

Since this is an $O(n*log(n))$ algorithm, its recursive structure unfolds in
such a manner that is much less regular than a tree-map. As such, the naive task
sharing approach had a consequence that greatly impacted performance on GPUs:
threads would give and receive "misaligned" redexes, causing warp-local threads
to compute different calls at any given point. For example, a given warp thread
might be processing `(flow 5 (Node _ _))`, while another might be processing
`(down 0 (Leaf _))` instead. This divergence has the consequence of producing
sequentialism in the GPU architecture, where warp-local threads are in lockstep.

To improve this, a different task-sharing mechanism has been implemented, which
requires a minimal annotation: redexes corresponding to branching recursive
calls are flagged with a `!` on the global Book. With this annotation, the GPU
evaluator will then only share redexes from functions that recurse in a
parallelizable fashion. This is extremely effective, as it allows threads to
always get "equivalent" redexes in a regular recursive algorithm. For example,
if given thread is processing `(flow 5 (Node _ _))`, it is very likely that
another warp local thread is too. This minimizes warp divergence, and has a
profound impact in performance across many cases. On the `bitonic_sort` example,
this new policy alone resulted in a jump from 1,300 MIPS to 12,000 MIPS (9x).

== Optimization: Shared Memory Interactions

GPUs also have another particularity that, if exploited properly, can result in
significant speedups: shared memory. The NVIDIA RTX 4090, for example, includes
A L1 Cache memory space of about 128KB, and GPU languages like CUDA usually
allow a programmer to manually read and write from that cache, in a shared
memory buffer that is accessible by all threads of a block. Reading and writing
from shared memory can be up to 2 orders of magnitude faster than doing so from
global memory, so, using that space properly is essential to fully harness a
GPU's computing power.

On HVM32, we use that space to store a local node buffer, and a local subst map,
with 8192 nodes and variables. This occupies exactly 96KB, just enough to fit
most modern processors. When a thread allocates a fresh node or variable, that
allocation occurs in the shared memory, rather than the global memory. With a
configuration of 128 threads per block, each thread has a "scratchpad" of 64
nodes and vars to work locally, with no global memory access. This is often
enough to compute long-running tail-loops, which is what makes HVM2 so efficient
on shader-like programs.

There is one problem, though: what happens when an interaction links a locally
allocated node to a global variable? This would cause a pointer to a local node
to "leak" to another block, which would then be unable to retrieve its
information, causing a runtime error. To handle this situation, we extend the
LINK interaction with a "LEAK" sub-interaction, which is specific to GPUs only.
That interaction essentially allocates a "global view" of the local node, filled
with two placeholder variables, such that one copy is local, and the other copy
is global (remember: variables are always paired). That way, we can continue the
local reduction without interruptions. If another block does get this "leaked"
node, it will be filled with two variables, which, in this case, act as "future"
values which will be resolved when the local thread links it.

```
^A ~ (b1 b2)
------------- LEAK
^X ~ b1
^Y ~ b2
^A ~ ^(^X ^Y)
```

The LEAK interaction allows us to safely work locally for as long as desired,
which has great impact on performance. On the stress test benchmark, the
throughput jumps from about 13,000 MIPS to 54,000 MIPS by this change alone.

= Garbage Collection

Since HVM2 is based on Interaction Combinators, which are fully linear, there is
no global "garbage collection" pass required. By IC evaluation semantics alone,
data is granularly allocated as needed, and freed as soon as they become
unreachable. This is specifically accomplished by the ERAS and VOID
interactions, which consume sub-nets that go out of scope in parallel, clearing
them from memory. When HVM2 completes evaluation (i.e., after all redexes have
been processed), the memory should be left with just the final result of the
program, and no remaining garbage or byproduct. No further work is needed.

= IO

TODO: explain how we do IO

= Benchmarks

TODO: include some benchmarks

= Translations

TODO: include example translations from high-level languages to HVM2

= Limitations

The HVM2 architecture, as currently presented, is capable of evaluating modern,
high-level programs in massively parallel hardware with near-ideal speedup,
which is a remarkable feat. That said, it has severe impactful limitations that
must be understood. Below is a list of many of these limitations, and how they
can be addressed in the future.

== Only one Duplicator Node

Since HVM2 is affine, duplicator nodes are often used to copy non-linear
variables. For example, when translating the $lambda$-term `λx.x+x`, which is not
linear (because `x` occurs twice), one might use a `DUP` node to clone `x`.
Due to Interaction Net semantics, though, `DUP` nodes don't always match the
behavior of cloning a variable on traditional languages. This, if not handled
properly, can lead to *unsound reductions*. For example, the $lambda$-term:

```
C4 = (λf.λx.(f (f x)) λf.λx.(f (f x)))
```

Which computes the Church-encoded exponentiation `2^2`, can not be soundly reduced by HVM2. To handle this, a source language must use either a type system or similar mechanism to verify that the following invariant holds:
#v(-1em)
#quote(block: true)[
  _A higher-order lambda that clones its variable can not be cloned._
]

While restrictive, it is important to stress that this limitation only applies
to cloning higher-order functions. A language that targets the HVM can still
clone data types with no restrictions, and it is still able to perform loops,
recursion and pattern-matches with no limitations. In other words, HVM is Turing
Complete, and can evaluate procedural languages with no restrictions, and
functional languages with some restrictions. That said, this can be amended by:

1. Adding more duplicator nodes. This would allow "nested copies" of
   higher-order functions. With a proper type system (such as EAL inference),
   this can greatly reduce the set of practical programs that are affected by
   this restriction.

2. Adding bookkeeping nodes. These nodes, originally proposed by Lamping (1990),
   allow interaction systems to evaluate the full λ-calculus with no
   restrictions. Adding bookkeeping to HVM should be easy. Sadly, this has the
   consequence of bringing a constant-time overhead, decreasing performance by
   about 10x. Because of that, it wasn't included in this model.

Ideally, a combination of both approaches should be used: a type-checker that
flags safe programs, which can be evaluated safely on HVM2, and a fallback
bookkeeper, which ensures sound reductions of programs that do not. Implementing
such system is outside the scope of this work, and should be done as a future
extension. [cite: the optimizing optimal evaluation paper]

== Ultra-Eager Evaluation Only

In our first implementation, HVM1, we used a lazy evaluation model. This not
only ensured that no unnecessary work was done, but also allowed one to compute
with infinite structures, like lists. Since the implementation presented here
reduces *all* available redexes eagerly, that means neither of these hold. For
example, if you allocate a big structure, but only read one branch, HVM2 will
allocate the entire structure, while HVM1 wouldn't. And if you do have an
infinite structure, HVM2 will never halt (because the redex bag will never be
empty). This applies even to code that doesn't look like it is an infinite
structure. For example, consider the JavaScript function below:

    ```
    foo = x => x == 0 ? 0 : 1 + foo(x-1);
    ```

In JavaScript, this is a perfectly valid function. In HVM2, if called as-is,
this would hang, because `foo(x-1)` would unroll infinitely, as we do not
"detect" that it is in a branch. To make recursive functions computable, the
usual approach is to split it into multiple definitions, as in:

    ```
    foo   = x => x == 0 ? fooZ : foo_S(x - 1);
    foo_Z = 0;
    foo_S = x => 1 + foo(x-1);
    ```

Since REFs unfold lazily, the program above will properly erase the `foo_S`
branch when it reaches the base case, avoiding the infinite recursion.

Extending HVM2 with a full lazy mode would requires us to store uplinks,
allowing threads to navigate through the graph and only reduce redexes that are
reachable from the root port. While not technically hard to do, doing so would
make the task scheduler way more complex to implement efficiently, specially in
the GPU version. We reserve this for a future extension.

== Single Core Inefficiency

While HVM2 achieves near-linear speedup, allowing it to make programs run
arbitrarily faster by just using more cores (as long as there is sufficient
degree of parallelism), its compiler is still extremely immature, and not nearly
as fast as state-of-art alternatives like GCC of GHC. In single-thread CPU
evaluation, HVM2, is, baseline, still about 5x slower than GHC, and this number
can grow to 100x on programs that involve loops and mutable arrays, since HVM2
doesn't feature these yet.

For example, a single-core C program that adds numbers from 0 to a few billions
will easily outperform an HVM2 one that uses thousands of threads, given the C
version is doing no allocation, while C is allocating a tree-like recursive
stack. That said, not every program can be implemented as an allocation-free,
register-mutating loop. For real programs that allocate tons of short memory
objects, HVM2 is expected to perform extremely well.

Moreover, and unlike some might think, HVM2 is not incompatible with loops or
mutable types, because it isn't a functional runtime, but one based on
interaction combinators, which are fully linear. Extending HVM2 with arrays is
as easy as creating nodes for it, and implementing the interactions, and can be
done in a timely fashion as a fork of this repository. Similarly, loops can be
implemented by optimizing tail-calls. We plan to add such optimization soon.

Finally, there are many other low-hanging fruits that could improve HVM2's
performance considerably. For example, currently, we do not have native
constructors, which means that algebraic datatypes have to be λ-encoded, which
brings a 2x-5x memory overhead. Adding proper constructors and eliminating this
overhead would likely bring a proportional speedup. Similarly, adding more
numeric types like vectors would allow using more of the available GPU
instructions, and adding read-only types like immutable strings and textures
with 1-interaction reads would allow one to implement many algorithms that,
currently, wouldn't be practical, specially for graphics rendering.

== 32-bit Architecture Limitations

Since this architecture is 32-bit, and since 3 bits are reserved for a tag, that
leaves us with a 29-bit addressable space. That amounts for a total of about 500
million nodes, or about 4 GB. Modern GPUs come with as much as 256 GB integrated
memory, so, HVM2 isn't able to fully use the available space, due to addressing
constraints. Moreover, its 29-bit unboxed numbers only allow for 24-bit machine
ints and floats, which may not be enough for many applications.

All these problems should be solved by extending ports to 64-bit and nodes to
128-bits, but this requires some additional considerations, since modern GPUs
don't come with 128-bit atomic operations. We'll do this in a future extension.

== More

*Work In Progress*

= Conclusion

By starting from a solid, inherently concurrent model of computation,
Interaction Combinators, carefully designing an efficient memory format,
implementing lock-free concurrent interactions via lightweight atomic
primitives, and granularly distributing workload across all cores, we were able
to design a parallel compiler and evaluator for high-level programming languages
that achieves near-linear speedup as a function of core count. While the
resulting system still has many limitations, we proposed sensible plans to
address them in the future.

This work creates a solid foundation for parallel programming languages that are
able to harness the massively parallel capabilities of modern hardware, without
demanding explicit low-level management of threads, locks, mutexes and other
complex synchronization primitives by the programmer.