calyx 0.7.1

Compiler Infrastructure for Hardware Accelerator Generation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
# Building a Frontend for Calyx

> This tutorial assumes that you have already worked through the [Calyx tutorial][calyx-tut]. You won't get very much out of it by itself!

In the [Calyx tutorial][calyx-tut] you wrote Calyx code by hand.
This is (probably) a good way to build character, but it's no way to live.
Indeed, Calyx was _designed_ to be a compiler IL, and not a human-facing language.

In this tutorial, we're going to learn all about this by building a DSL-to-hardware compiler for a toy language that we wish to accelerate.
We will compile the DSL to Calyx, and then let Calyx take us to hardware.

# MrXL Overview

Meet MrXL, our toy DSL.
MrXL lets you define arrays and registers and then perform `map` and `reduce` operations.


## Example: sum of squares

Here's a MrXL program that squares and then sums the values of an input array:
```
{{#include ../../frontends/mrxl/test/sos.mrxl}}
```

This short program shows off all of MrXL's features, so let's pick it apart line by line:
1. We specify an array, `avec`, which will have four integers. The `input` keyword means that an external harness will populate the array.
2. We specify `sos`, a register. The `output` keyword means that we will populate `sos` in our program.
3. The `map` operation gets the values of `avec` and raises each to the second power. We stash the result in a new array, `squares`. The number `1` denotes a *parallelism factor* of 1, meaning that the operation is performed sequentially. We will improve this shortly.
4. The `reduce` operation walks over `squares` and accumulates the result into a register. The parallelism factor is again `1`.


## Running our example

### Installing MrXL

> If you are going through this tutorial in the [Docker container][docker], you can skip these installation steps and jump to [*Running MrXL*][running-mrxl-example] just below.

First, install the [`builder` library][calyx-py-lib] by typing the following command from the repository root:
```
cd calyx-py && flit install -s && cd -
```
Next, install the `mrxl` binary:
```
cd frontends/mrxl && flit install -s && cd -
```
Register the `mrxl` binary with `fud`:
```
fud register mrxl -p frontends/mrxl/fud/mrxl.py
```
Now, running `fud check` should report that the `mrxl` binary is correctly installed.

### Running MrXL

Run:
```
mrxl frontends/mrxl/test/sos.mrxl --data frontends/mrxl/test/sos.mrxl.data --interpret
```

Why `42`? Because we populated `avec` with:
```json
{{#include ../../frontends/mrxl/test/sos.mrxl.data}}
```
and \\( 0^2 + 1^2 + 4^2 + 5^2 = 42 \\).


## Compiling our example into Calyx

Above, we merely _interpreted_ MrXL code in software using a simple, pre-written interpreter implemented in Python.
Our goal in this tutorial is to build a compiler from MrXL to hardware by translating it to the Calyx IL.
The Calyx IL code generated by compiling this program looks more like:

<details>
<summary>Click to expand 103 lines.</summary>

```C
{{#include ./sos.calyx}}
```
</details>

Generate it for yourself! Run:
```
mrxl frontends/mrxl/test/sos.mrxl
```

## Simulating our example with Verilog

Finally, let us go the whole hog: we compile our MrXL program to Calyx, compile it to Verilog, then simulate it using [Verilator][].

Run:
```
fud e --from mrxl frontends/mrxl/test/sos.mrxl --to dat --through verilog -s mrxl.data frontends/mrxl/test/sos.mrxl.data
```

The above command takes a MrXL program, `sos.mrxl`, and generates results with Verilator using the MrXL data file `sos.mrxl.data`.

# Compiling MrXL into Calyx

Calyx is an infrastructure for designing *domain-specific languages* (DSL) which can generate efficient hardware.
The rest of the tutorial will show you how to implement such a DSL by studying [the MrXL-to-Calyx compiler][impl], written in Python.

We have placed a few simplifying restrictions on MrXL programs:
1. Every array in a MrXL program has the same length.
2. Every integer in our generated hardware is 32 bits long.
3. The bodies of `map` and `reduce` operations must be binary `+` or `*` operations involving array elements or integers.
4. If repeated `map`/`reduce` operations are performed on the same array, each of those operations must have the same parallelism factor.
5. All `reduce` operations must be performed sequentially, i.e., with parallelism factor `1`.

These restrictions can be lifted or relaxed via commensurate changes to the compiler.

The compilation process breaks into two steps:
1. Parsing MrXL into a representation we can process in Python.
1. Generating Calyx code.


## Parsing MrXL into an AST

To start, we'll parse the MrXL program into a Python AST representation.
We choose to represent [AST][astcode] nodes with [Python dataclasses][].
A program is a sequence of array/register declarations followed by computation statements:
```python
{{#include ../../frontends/mrxl/mrxl/ast.py:prog}}
```

`Decl` nodes correspond to array declarations such as `input avec: int[4]`.
They carry information about whether the array is an `input` or `output`, the array's name, and the type of the array's elements:
```python
{{#include ../../frontends/mrxl/mrxl/ast.py:decl}}
```

`Stmt` nodes represent statements such as `sos := reduce 1 (acc, i <- squares) 0 { acc + i }`.
They contain further nested nodes representing the function-header and -body, and the type of operation:
```python
{{#include ../../frontends/mrxl/mrxl/ast.py:stmt}}
```

We elide further details, but point you to the [AST][mrxl-ast], which defines all the nodes we need to represent a MrXL program.


## Generating Calyx code

[As you know][calyx-tut], the skeleton of a Calyx program has three sections:
```
component main() -> {
  cells {}
  wires {}
  control {}
}
```

The [cells section][lf-cells] instantiates hardware units like adders, memories and registers.
The [wires section][lf-wires] contains [groups][lf-groups] that connect
hardware instances to perform some logical task (e.g, incrementing a register).
Finally, the [control section][lf-control] *schedules* the execution of groups using control operators such as `seq`, `par`, and `while`.

We perform syntax-directed compilation by walking over the nodes of the AST and generating `cells`, `wires`, and `control` operations.


### An Embedded DSL that Generates Calyx

To make it easy to generate hardware, we'll use Calyx's [`builder` module][calyx-py-lib] written in Python:
```python
import calyx.builder as cb

prog = cb.Builder() # A Calyx program
main = prog.component("main") # Create a component named "main"
```


### `Decl` nodes

`Decl` nodes instantiate new memories and registers.
We need these to be instantiated in the `cells` section of our Calyx output.
We use Calyx's [`std_reg`][reg-calyx] and [`comb_mem_d1`][mem-calyx] primitives to represent [registers][reg-verilog] and [memories][mem-verilog]:
```C
import "primitives/core.futil"; // Import standard library

component main() -> () {
  cells {
    // A memory with 4 32-bit elements. Indexed using a 6-bit value.
    avec = comb_mem_d1(32, 4, 6);
    // A register that contains a 32-bit value
    sos = std_reg(32);
  }
  ...
}
```

For each `Decl` node, we need to determine if we're instantiating a memory or a register, translate the node into a corresponding Calyx declaration, and place the declaration inside the `cells` section of our generated program.

If a memory is used in a parallel `map` or `reduce`, we might need to create different physical banks for it.
We explain why [below][banking-need].
We [define a function][compute-par] to walk over the AST and compute the parallelism factor for each memory:
```python
{{#include ../../frontends/mrxl/mrxl/gen_calyx.py:compute_par_factors}}
```

Using this information, we can instantiate registers and memories for our inputs and outputs:
```python
{{#include ../../frontends/mrxl/mrxl/gen_calyx.py:collect-decls}}
```
The `main.mem_d1` call is a function defined by the Calyx builder module to instantiate memories for a component.
By setting `is_external=True`, we're indicating that a memory declaration is a part of the component's input-output interface.


## Compiling `map` operations

For every `map` or `reduce` node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation.
For `map` operations, we'll perform a computation on every element of an input array, and then store the answers in a result array.
We can use Calyx's [while loops][lf-while] to do this.
At a high level, we want to generate the following pieces of hardware:
1. A register to store the current value of the loop index.
2. A comparator to check of the loop index is less than the array size.
3. An adder to increment the value of the index.
4. Whatever hardware is needed to implement the loop body computation.

We have implemented exactly this, and you have been using it thus far with the `fud` invocations that we have provided you.

However, it is time to get your hands dirty.
We provide a [stub implementation][mymap-stub] of `map` in `gen_calyx.py`:

```python
{{#include ../../frontends/mrxl/mrxl/gen_calyx.py:my_map_impl}}
```
You are invited to try implementing map yourself according to the outline given in the description by filling in the body of this function.

To run `mrxl` with `my_map_impl` instead of our implementation, pass the `--my-map` flag:
```sh
fud e --from mrxl test/sos.mrxl \
        --to dat --through verilog \
        -s mrxl.flags "--my-map "  \
        -s mrxl.data test/sos.mrxl.data
```

If you are feeling good about your implementation, skip to [the next section](#adding-parallelization)!
If you'd like to read through the details of our implementation – or build yours in tandem – continue on with the rest of this section.

### Loop condition

We define a [combinational group][lf-comb-group] to perform the comparison `idx < arr_size` that uses an `lt` cell:
```python
{{#include ../../frontends/mrxl/mrxl/gen_calyx.py:cond_group}}
```


### Index increment

The loop index increment is implemented using a [group][lf-groups] and an `adder`:
```python
{{#include ../../frontends/mrxl/mrxl/gen_calyx.py:incr_group}}
```

We provide the index's previous value and the constant `1` to `adder`, and write the adder's output into the register (`idx`).
Because we're performing a stateful update of the register, we must wait for the register to state that it has committed the write.
We do this by setting the group's `done` condition to track the register's `done` signal.


### Body computation

The final piece of the puzzle is the body's computation.
The corresponding group indexes into the input memories:
```python
{{#include ../../frontends/mrxl/mrxl/map.py:map_inputs}}
```
Because the builder module is an embedded DSL, we can simply use Python's `for` loop to generate all the required assignments for indexing.

This code instantiates an adder or a multiplier depending on the computation needed using the `expr_to_port` helper function:
```python
{{#include ../../frontends/mrxl/mrxl/map.py:map_op}}
```

and writes the value from the operation into the output memory:
```py
{{#include ../../frontends/mrxl/mrxl/map.py:map_write}}
```
This final operation is complicated because we must account for whether we're using an adder or a multiplier.
Adders are *combinational*–they produce their output immediately–while multipliers are *sequential* and require multiple cycles to produce its output.

When using a mutliplier, we need to explicitly set its `go` signal to one and only write the output from the multiplier into the memory when its `done` signal is asserted.
We do this by assigning the memory's `write_en` (write enable) signal to the multiplier's done signal.
Finally, the group's computation is done when the memory write is committed.


### Generating control

Once we have generated the hardware needed for our computation, we can schedule its computation using [control operators][lf-control]:
```py
{{#include ../../frontends/mrxl/mrxl/map.py:map_loop}}
```

We generate a while loop that checks that the index is less than the array size.
Then, it sequentially executes the computation for the body and increments the loop index.

> Take a breather! You now understand the basics of compiling DSLs to Calyx.
> However, we're not done yet; we build hardware because we want things to be fast and efficient, so the next part will teach you how to parallelize MrXL programs.

## Adding parallelization

MrXL allows us to parallelize our `map` operations.

Consider a variation on our prior example:
```
{{#include ../../frontends/mrxl/test/squares.mrxl}}
```

The noteworthy change is the parallelism factor of the `map` operation.
The parallelism factor `2` specifies that two copies of the loop bodies should be executed in parallel.

Translating this into a hardware implementation has a couple of associated challenges:
1. Our memories (`comb_mem_d1`) are *extremely* primitive and do not support parallel accesses; a program may only read or write to one memory location every cycle. In order to support parallel accesses, we need to create *multiple physical banks* that represent one logical memory and contain distinct elements.
2. We need to generate *multiple copies* of the hardware we generated above because, again, adders and multipliers are physical resources and can only support one computation at a time.

To produce the full Calyx program for the above example, run the following from the root of the Calyx repository:
```
fud e --from mrxl --to calyx frontends/mrxl/test/squares.mrxl
```


### Memory banking

<!-- Let us look back at our example featuring a parallel `map`:
```
{{#include ../../frontends/mrxl/test/squares.mrxl}}
```

Interpret this with:
```
mrxl frontends/mrxl/test/squares.mrxl --data frontends/mrxl/test/squares.mrxl.data --interpret
```

But more interestingly, compile this to Calyx IL with:
```
mrxl frontends/mrxl/test/squares.mrxl
``` -->

There's a lot going on in the Calyx code, but the thing to focus on is this.
To take advantage of the parallelism in the program, the MrXL compiler assumes that the input memory `avec` is split into two *banks*, `avec_b0` and `avec_b1`.
Look for these in the Calyx code.

Why do we do this? Memories in Calyx can only support one read/write per cycle of the clock, so if we keep `avec` around as one memory, our attempts at parallelization will be thwarted simply because we will be bottlenecked when accessing our data.
Splitting `avec` into two banks allows us to read/write into two logical spots of `avec` in parallel.

Banking comes with an additional responsibility.
When driving data to the memories, we can't just supply values for a memory `avec`.
The memory `avec` exists logically in our minds, but Calyx now only knows about `avec_b0` and `avec_b1`.
We must drive data to *these*.
Though nontrivial, this data-banking can also be [handled automatically](#aside-supplying-data-to-mrxl-programs); all the necessary information is in the MrXL source program.


### Parallel Control

Next, our compiler duplicates the computational resources, hooks them up to the right memories using groups, and generates a control program that looks like this:

```
par {
  while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
  while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
}
```

The [`par` operator][lf-par] executes all the loops in parallel which use distinct computational resources.
As specified by the language specification, [conflicting resource usage is undefined behavior][par-undef].

You can use `fud` to compile the MrXL program and run it with some data:
```
fud e --from mrxl --to dat \
      --through verilog \
      -s mrxl.data frontends/mrxl/test/squares.mrxl.data \
      frontends/mrxl/test/squares.mrxl
```

> The [complete implementation][impl] shows the necessary code to create physical memory banks and create an outer loop to generate distinct hardware for each copy of the loop.


# Further Steps

Congratulations, you know about as much about MrXL as we do!
The small size of the language makes it a nice sandbox for you to play in.
We mentioned that the restrictions placed on the language can be lifted by beefing up the compiler, and here's your chance to give it a whirl!

Here are some of those restrictions again, along with pointers about how to lift them.

1.
    > The bodies of `map` and `reduce` operations must be binary `+` or `*` operations involving array elements or integers.

    Say you wanted to add subtraction and division to the mix.
    We have set you up for success: the MrXL parser already parses `-` and `/` into `sub` and `div` respectively.
    Now, in `gen_calyx.py`, you need to check for "sub" and "div" as possible binary operations, and then invoke the appropriate cell-builders of the `builder` library.
    For reference, see how the `+` and `*` operations are handled at present.
    For "fun", take a look at how Calyx implements [multiplication][binary-mult], and how that maps to the existing invocation to create a 32-bit multiplication cell using the `builder`!

2.
    > All `reduce` operations must be performed sequentially, i.e., with parallelism factor `1`.

    This is a big gap!
    One way to perform reductions in parallel is using _reduction trees_.
    To get you started, we provide a toy implementation using the `builder` [here][builder-red-tree].
    That example is rather brittle: it takes exactly 16 inputs, banked into four arrays, and adds their values together.
    Try incorporating this brittle version into your MrXL-to-Calyx compiler at first, and you can later think about generalizing it to any (commutative) operation, memories of any length, and any parallelism factor.

    If you'd like to read more about reduction trees, you can check out [these slides][reduc-trees] from David Kirk and Wen-mei W. Hwu.

3.
    > If repeated `map`/`reduce` operations are performed on the same array, each of those operations must have the same parallelism factor.

    The heart of the issue is figuring out how to bank the underlying array.
    How do we bank it two different ways at the same time?
    The answer is to bank the array a little finer than you think, and to then use _arbitration logic_ to provide two fictional banking setups at the same time.
    We provide a toy implementation using the `builder` [here][builder-arb].
    There, a 24-cell array has been split into _six_ banks, but then we allow the user to pretend, simultaneously, that it is split into _two_ banks or _three_ banks.


# Aside: supplying data to MrXL programs

You may have noticed that the data files that we pass to MrXL programs are lighter-weight than those we pass to Calyx programs.
They are lighter in two ways.


## Boilerplate

Calyx requires cells to be allocated for the output cells.
Instead of asking the user to supply zeroed-out arrays and registers for output cells, we can infer the need for these output cells from the source code.
We can then add these on to the MrXL-native data files to make them amenable to Calyx.

Additionally, because MrXL supports only a few kinds of data, some interesting parameters of Calyx-native data files turn into "just boilerplate" in MrXL-native data.
We can keep MrXL-native data files relatively light and add this `format` information automatically.

## Example: `squares.mrxl.data`

Let us look at these changes in practice.

We write `avec`'s values as a straightforward array:
```json
{{#include ../../frontends/mrxl/test/squares.mrxl.data}}
```

but under the hood, the *Calyx IL version* of `squares.mrxl` comes to expect something of the form:
```json
{{#include ./squares.mrxl.banked.data}}
```

And we can generate this automatically. Try it yourself:
```
mrxl frontends/mrxl/test/squares.mrxl --data frontends/mrxl/test/squares.mrxl.data --convert
```

This transformation is achieved using a [`fud`][fud] pass that converts MrXL-native data files into Calyx-native data files.

[astcode]: https://github.com/calyxir/calyx/blob/mrxl/mrxl/mrxl/ast.py
[mrxldocs-install]: https://docs.calyxir.org/frontends/mrxl.html#install
[fud]: ../fud/index.md
[fud-data]: ../lang/data-format.md
[json]: https://www.json.org/json-en.html
[calyx-tut]: ./language-tut.md
[mrxl-ast]: https://github.com/calyxir/calyx/blob/master/frontends/mrxl/mrxl/ast.py
[lf-cells]: ../lang/ref.md#cells
[lf-wires]: ../lang/ref.md#the-wires-section
[lf-groups]: ../lang/ref.md#group-definitions
[lf-control]: ../lang/ref.md#the-control-operators
[lf-while]: ../lang/ref.md#while
[lf-comb-group]: ../lang/ref.md#comb-group-definitions
[lf-par]: ../lang/ref.md#par
[impl]: https://github.com/calyxir/calyx/blob/master/frontends/mrxl/mrxl/gen_calyx.py
[reduc-trees]: http://www.cs.ucr.edu/~nael/217-f15/lectures/217-lec10.pdf
[builder-arb]: https://github.com/calyxir/calyx/blob/master/calyx-py/test/arbiter_6.py
[builder-red-tree]: https://github.com/calyxir/calyx/blob/master/calyx-py/test/reduction_tree.py
[flit]: https://flit.readthedocs.io/en/latest/index.html
[calyx-py-lib]: ../calyx-py.md
[python dataclasses]: https://docs.python.org/3/library/dataclasses.html
[reg-calyx]: https://github.com/calyxir/calyx/blob/45075345ae2858b23a599d65d94b0ed7bf949a61/primitives/compile.futil#L22
[reg-verilog]: https://github.com/calyxir/calyx/blob/master/primitives/compile.futil#L31
[mem-calyx]: https://github.com/calyxir/calyx/blob/45075345ae2858b23a599d65d94b0ed7bf949a61/primitives/compile.futil#L22
[mem-verilog]: https://github.com/calyxir/calyx/blob/master/primitives/core.sv#L220
[compute-par]: https://github.com/calyxir/calyx/blob/45075345ae2858b23a599d65d94b0ed7bf949a61/frontends/mrxl/mrxl/gen_calyx.py#L312
[par-undef]: ../lang/ref.md#par
[binary-mult]: https://github.com/calyxir/calyx/blob/master/primitives/binary_operators.sv#L27-L45
[verilator]: https://www.veripool.org/wiki/verilator
[docker]: https://github.com/calyxir/calyx/pkgs/container/calyx
[running-mrxl-example]: #running-mrxl
[mymap-stub]: https://github.com/calyxir/calyx/blob/master/frontends/mrxl/mrxl/gen_calyx.py#L171-L191
[banking-need]: #memory-banking