bend-lang 0.2.6

A high-level, massively parallel programming language
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
Bend in X minutes - the ultimate guide!
=======================================

Bend is a high-level, massively parallel programming language. That means it
feels like Python, but scales like CUDA. It runs on CPUs and GPUs, and you don't
have to do anything to make it parallel: as long as your code isn't "helplessly
sequential", it **will** use 1000's of threads!

While cool, Bend is far from perfect. In absolute terms it is still not so fast.
Compared to SOTA compilers like GCC or GHC, our code gen is still embarrassingly
bad, and there is a lot to improve. And, of course, in this beginning, there
will be tons of instability and bugs. That said, it does what it promises:
scaling horizontally with cores. And that's really cool! If you'd like to be an
early adopter of this interesting tech, this guide will teach you how to apply
Bend to build parallel programs in a new way!

For a more technical dive, check HVM2's
[paper](http://paper.HigherOrderCO.com/). For an entertaining, intuitive
explanation, see HVM1's classic
[HOW.md](https://github.com/HigherOrderCO/HVM/blob/master/guide/HOW.md). But if
you just want to dive straight into action - this guide is for you. Let's go!

Installation
------------

To use Bend, first, install [Rust nightly](https://www.oreilly.com/library/view/rust-programming-by/9781788390637/e07dc768-de29-482e-804b-0274b4bef418.xhtml). Then, install HVM2 and Bend itself with:

```
cargo +nightly install hvm
cargo +nightly install bend-lang
```

To test if it worked, type:

```
bend --help
```

For GPU support, you also need CUDA version `12.X`.

Hello, World!
-------------

As we said, Bend *feels* like Python - in some ways. It is high-level, you can
easily create objects and lists, there are ifs and loops. Yet, it is different:
there is some Haskell in it, in the sense algebraic datatypes, pattern-matching
and recursion play an important role. This is how its  `"Hello, world!"` looks:

```python
def main():
  return "Hello, world!"
```

Wait - there is something strange there. Why `return`, not `print`? Well, *for
now* (you'll read these words a lot), Bend doesn't have IO. We plan on
introducing it very soon! So, *for now*, all you can do is perform computations,
and see results. To run the program above, type:

```
bend run main.bend
```

If all goes well, you should see `"Hello, world!"`. The `bend run` command uses
the reference interpreter, which is slow. In a few moments, we'll teach you how
to run your code in parallel, on both CPUs and GPUs. For now, let's learn some
fundamentals!

Basic Functions and Datatypes
-----------------------------

In Bend, functions are pure: they receive something, and they return something.
That's all. Here is a function that tells you how old you are:

```python
def am_i_old(age):
  if age < 18:
    return "you're a kid"
  else:
    return "you're an adult"

def main():
  return am_i_old(32)
```

That is simple enough, isn't it? Here is one that returns the distance between
two points:

```python
def distance(ax, ay, bx, by):
  dx = bx - ax
  dy = by - ay
  return (dx * dx + dy * dy) ** 0.5

def main():
  return distance(10.0, 10.0, 20.0, 20.0)
```

This isn't so pretty. Could we use tuples instead? Yes:

```python
def distance(a, b):
  (ax, ay) = a
  (bx, by) = b
  dx = bx - ax
  dy = by - ay
  return (dx * dx + dy * dy) ** 0.5

def main():
  return distance((10.0, 10.0), (20.0, 20.0))
```

So far, this does look like Python, doesn't it? What about objects? Well - here,
there is a difference. In Python, we have classes. In Bend, we just have the
objects themselves. This is how we create a 2D vector:

```python
object V2 { x, y }

def distance(a, b):
  open V2: a
  open V2: b
  dx = b.x - a.x
  dy = b.y - a.y
  return (dx * dx + dy * dy) ** 0.5

def main():
  return distance(V2 { x: 10.0, y: 10.0 }, V2 { x: 20.0, y: 20.0 })
```

This doesn't look too different, does it? What is that `open` thing, though? It
just tells Bend to *consume* the vector, `a`, "splitting" it into its
components, `a.x` and `a.y`. Is that really necessary? Actually, no - not
really. But, *for now*, it is. This has to do with the fact Bend is an affine
language, which... well, let's not get into that. For now, just remember we
need `open` to access fields.

Bend comes with 3 built-in numeric types: `u24`, `i24`, `f24`. That's quite
small, we admit. Soon, we'll have larger types. For now, that's what we got.
The `u24` type is written like `123` or `0xF`. The `i24` type requires a sign,
as in, `+7` or `-7`. The `f24` type uses `.`, like `3.14`.

Other than tuples, Bend has another, very general, way to encode data:
datatypes! These are just "objects with tags". A classic example of this is a
"shape", which can be either a circle or a rectangle. It is defined like this:

```python
type Shape:
  Circle { radius }
  Rectangle { width, height }

def area(shape):
  match shape:
    case Shape/Circle:
      return 3.14 * shape.radius ** 2.0
    case Shape/Rectangle:
      return shape.width * shape.height

def main:
  return area(Shape/Circle { radius: 10.0 })
```

In this example, `Shape` is a datatype with two variants: `Circle` and
`Rectangle`.  The `area` function uses pattern matching to handle each variant
appropriately.  Just like objects need `open`, datatypes need `match`, which
give us access to fields in each respective case.

Datatypes are very general. From matrices, to JSON, to quadtrees, every type of
data can be represented as a datatype (I mean, that's the name!). In fact,
lists - which, on Python, are actually stored as arrays - are represented using
datatypes on Bend. Specifically, the type:

```python
type List:
  Nil
  Cons { head, tail }
```

Here, the `Nil` variant represents an empty list, and the `Cons` variant
represents a concatenation between an element (`head`) and another list
(`tail`). That way, the `[1,2,3]` list could be written as:

```python
def main:
  my_list = List/Cons { head: 1, tail: List/Cons { head: 2, tail: List/Cons { head: 3, tail: List/Nil }}}
  return my_list
```

Obviously - that's terrible. So, you can write just instead:

```python
def main:
  my_list = [1, 2, 3]
  return my_list
```

Which is decent. But while it is written the same as in Python, it is important
to understand it is just the `List` datatype, which means we can operate on it
using the `match` notation. For example:

```python
def main:
  my_list = [1, 2, 3]
  match my_list:
    case List/Cons:
      return my_list.head
    case List/Nil:
      return 0
```

Will return `1`, which is the first element.

We also have a syntax sugar for strings in Bend, which is just a List of `u24`
characters (UTF-16 encoded). The `"Hello, world!"` type we've seen used it!

Bend also has inline functions, which work just like Python:

```python
def main:
  mul_2 = lambda x: x * 2
  return mul_2(7)
```

Except without the annoying syntax restrictions. You can also shorten it as `λ`,
if you can somehow type that.

*note - some types, like tuples, aren't being pretty-printed correctly after
computation. this will be fixed in the next days (TM)*

The Dreaded Immutability
------------------------

Finally, let's get straight to the fun part: how do we implement parallel
algorithms with Bend? Just kidding. Before we get there, let's talk about loops.
You might have noticed we have avoided them so far. That wasn't by accident.
There is an important aspect on which Bend diverges from Python, and aligns with
Haskell: **variables are immutable**. Not "by default". They just **are**. For
example, in Bend, we're not allowed to write:

```python
def parity(x):
  result = "odd"
  if x % 2 == 0:
    result = "even"
  return result
```

... because that would mutate the `result` variable. Instead, we should write:

```python
def is_even(x):
  if x % 2 == 0:
    return "even"
  else:
    return "odd"

def main:
  return is_even(7)
```

Which is immutable. If that sounds annoying, that's because **it is**. Don't
let anyone tell you otherwise. We are aware of that, and we have many ideas on
how to improve this, making Bend feel even more Python-like. For now, we have to
live with it. But, wait... if variables are immutable... how do we even do
loops? For example:

```python
def sum(x):
  total = 0 
  for i in range(10)
    total += i
  return total
```

Here, the entire way the algorithm works is by mutating the `total` variable.
Without mutability, loops don't make sense. The good news is Bend has *something
else* that is equally as - actually, more - powerful. And learning it is really
worth your time. Let's do it!

Folds and Bends
---------------

### Recursive Datatypes

Let's start by implementing a recursive datatype in Bend:

```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }
```

This defines a binary tree, with elements on leaves. For example, the tree:

```
  __/\__
 /\     /\
1  2   3  4
```

Could be represented as:

```
tree = Tree/Node {
  lft: Tree/Node { lft: Tree/Leaf { val: 1 }, rgt: Tree/Leaf { val: 2 } },
  rgt: Tree/Node { lft: Tree/Leaf { val: 3 }, rgt: Tree/Leaf { val: 4 } }
}
```

That's ugly. Very soon, we'll add a syntax sugar to make this shorter:

```
tree = ![![1,2],![3,4]]
```

As usual, for now, we'll live with the longer version.

### Fold: consuming recursive datatypes

Now, here's a question: how do we *sum* the elements of a tree? In Python, we
could just use a loop. In Bend, we don't have loops. Fortunately, there is
another construct we can use: it's called `fold`, and it works like a *search
and replace* for datatypes. For example, consider the code below:


```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }

def sum(tree):
  fold tree:
    case Tree/Node:
      return tree.lft + tree.rgt
    case Tree/Leaf:
      return tree.val

def main:
  tree = Tree/Node {
    lft: Tree/Node { lft: Tree/Leaf { val: 1 }, rgt: Tree/Leaf { val: 2 } },
    rgt: Tree/Node { lft: Tree/Leaf { val: 3 }, rgt: Tree/Leaf { val: 4 } }
  }
  return sum(tree)
```

It accomplishes the task by replacing every `Tree/Node { lft, rgt }` by `lft +
rgt`, and replacing every `Tree/Leaf` by `val`. As a result, the entire "tree of
values" is turned into a "tree of additions", and it evaluates as follows:

```python
nums = ((1 + 2) + (3 + 4))
nums = (3 + 7)
nums = 10
```

Now, this may look limiting, but it actually isn't. Folds are known for being
universal: *any algorithm that can be implemented with a loop, can be
implemented with a fold*. So, we can do much more than just compute an
"aggregated value". Suppose we wanted, for example, to transform every element
into a tuple of `(index,value)`, returning a new tree. Here's how to do it:

```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }

def enum(tree):
  idx = 0
  fold tree with idx:
    case Tree/Node:
      return Tree/Node {
        lft: tree.lft(idx * 2 + 0),
        rgt: tree.rgt(idx * 2 + 1),
      }
    case Tree/Leaf:
      return (idx, tree.val)

def main:
  tree = Tree/Node {
    lft: Tree/Node { lft: Tree/Leaf { val: 1 }, rgt: Tree/Leaf { val: 2 }, },
    rgt: Tree/Node { lft: Tree/Leaf { val: 3 }, rgt: Tree/Leaf { val: 4 }, }
  }
  return sum(tree)
```

Compared to the `sum` algorithm, 3 important things changed:

1. We initialize a state, `idx`, as `0`.

2. We pass new states down as `tree.xyz(new_idx)`

3. The base case receives the final state: the element index

So, in the end, we'll have computed a copy of the original tree, except that
every element has now became a tuple of index and value.

Now, please take a moment to think about this fact: **everything can be computed
with a fold.** This idea often takes some time to get used to, but, once you do,
it is really liberating, and will let you write better algorithms. As an
exercise, use `fold` to implement a "reverse" algorithm for lists:

```python
def reverse(list):
  # exercise
  ?

def main:
  return reverse([1,2,3])
```

## Bend: generating recursive datatypes

Bending is the opposite of folding. Whatever `fold` consumes, `bend` creates.
The idea is that, by defining an *initial state* and a *halting condition*, we
can "grow" a recursive structure, layer by layer, until the condition is met. 
For example, consider the code below:

```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }

def main():
  bend x = 0:
    when x < 3:
      tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
    else:
      tree = Tree/Leaf { val: 7 }
  return tree
```

The program above will initialize a state (`x = 0`), and then, for as long as `x
< 3`, it will "fork" that state in two, creating a `Tree/Node`, and continuing
with `x + 1`. When `x >= 3`, it will halt and return a `Tree/Leaf` with `7`.
When all is done, the result will be assigned to the `tree` variable:

```python
tree = fork(0)
tree = ![fork(1), fork(1)]
tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]]
tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]
```

With some imagination, we can easily see that, by recursively unrolling a state
this way, we can generate any structure we'd like. In fact, `bend` is so general
we can even use it to emulate a loop. For example, this Python program:

```python
sum = 0
idx = 0
while idx < 10:
  sum = idx + sum
  idx = idx + 1
```

Could be emulated in Bend with a "sequential bend":

```python
bend idx = 0:
  when idx < 10:
    sum = idx + fork(idx + 1)
  else:
    sum = 0
```

Of course, if you do it, Bend's devs will be very disappointed with you. Why?
Because everyone is here for one thing. Let's do it!

Parallel "Hello, World"
-----------------------

So, after all this learning, we're now ready to answer the ultimate question:

**How do we write parallel algorithms in Bend?**

At this point, you might have the idea: by using *folds* and *bends*, right?
Well... actually not! You do not need to use these constructs at all to make it
happen. Anything that *can* be parallelized *will* be parallelized on Bend. To
be more precise, this:

```
f(g(x))
```

Can NOT be parallelized, because `f` **depends** on the result of `g`. But this:

```
H(f(x), g(y))
```

Can be parallelized, because `f(x)` and `g(y)` are **independent**. Traditional
loops, on the other hands, are inherently sequential. A loop like:

```python
sum = 0
for i in range(8):
  sum += i
```

Is actually just a similar way to write:

```python
sum = (0 + (1 + (2 + (3 + (4 + (5 + (6 + 7)))))))
```

Which is *really bad* for parallelism, because the only way to compute this is
by evaluating the expressions one after the other, in order:

```python
sum = (0 + (1 + (2 + (3 + (4 + (5 + (6 + 7)))))))
sum = (0 + (1 + (2 + (3 + (4 + (5 + 13))))))
sum = (0 + (1 + (2 + (3 + (4 + 18)))))
sum = (0 + (1 + (2 + (3 + 22))))
sum = (0 + (1 + (2 + 25)))
sum = (0 + (1 + 27))
sum = (0 + 28)
sum = 28
```

There is nothing Bend could do to save this program: sequentialism is an
inherent part of its logic. Now, if we had written, instead:

```python
sum = (((0 + 1) + (2 + 3)) + ((4 + 5) + (6 + 7)))
```

Then, we'd have a much easier time evaluating that in parallel. Look at it:

```python
sum = (((0 + 1) + (2 + 3)) + ((4 + 5) + (6 + 7)))
sum = ((1 + 5) + (9 + 13))
sum = (6 + 22)
sum = 28
```

That's so much better that even the *line count* is shorter!

So, how do you write a parallel program in Bend?

**Just write algorithms that aren't helplessly sequential.**

That's all there is to it. As long as you write programs like that one, then
unlike the former one, they will run in parallel. And that's why `bend` and
`fold` are core features: they're, essentially, parallelizable loops. For
example, to add numbers in parallel, we can write:

```python
def main():
  bend d = 0, i = 0:
    when d < 28:
      sum = fork(d+1, i*2+0) + fork(d+1, i*2+1)
    else:
      sum = i
  return sum
```

And that's the parallel "Hello, world"! Now, let's finally run it. But first,
let's measure its single-core performance. Also, remember that, for now, Bend
only supports 24-bit numbers (`u24`), thus, the results will always be in `mod
16777216`.

```
bend run main.bend
```

On my machine (Apple M3 Max), it completes after `147s`, at `65 MIPS` (Million
Interactions Per Second - Bend's version of the FLOPS). That's too long. Let's
run it in parallel, by using the **C interpreter** instead:

```
bend run-c main.bend
```

And, just like that, the same program now runs in `8.49s`, at `1137 MIPS`.
That's **18x faster**! Can we do better? Sure: let's use the **C compiler** now:

```
bend gen-c main.bend >> main.c
```

This command converts your `bend` file into a small, dependency-free C file
that does the same computation much faster. You can compile it to an executable:

```
gcc main.c -o main -O2 -lm -lpthread # if you're on Linux
gcc main.c -o main -O2               # if you're on OSX
./main
```

Now, the same program runs in `5.81s`, at `1661.91 MIPS`. That's now **25x
faster** than the original! Can we do better? Let's now enter the unexplored
realms of arbitrary high-level programs on... GPUs. How hard that could be?
Well, for us... it was. A lot. For you... just call the **CUDA interpreter**:

```
bend run-cu main.bend
```

And, simply as that, the same program now runs in `0.82s`, at a blistering
`11803.24 MIPS`. That's **181x faster** than the original. Congratulations!
You're now a thread bender.

~

As a last note, you may have noticed that the compiled version isn't much faster
than the interpreted one. Our compiler is still on its infancy, and the assembly
generated is quite abysmal. Most of our effort went into setting up a foundation
for the parallel evaluator, which was no easy task. With that out of our way,
improving the compiler is a higher priority now. You can expect it to improve
continuously over time. For now, it is important to understand the state of
things, and set up reasonable expectations.

A Parallel Bitonic Sort
-----------------------

The bitonic sort is a popular algorithm that sorts a set of numbers by moving
them through a "circuit" (sorting network) and swapping as they pass through:

![bsort](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/BitonicSort1.svg/1686px-BitonicSort1.svg.png)

In CUDA, this can be implemented by using mutable arrays and synchronization
primitives. This is well known. What is less known is that it can also be
implemented as a series of *immutable tree rotations*, with pattern-matching and
recursion. Don't bother trying to understand it, but, here's the code:

```python
def gen(d, x):
  switch d:
    case 0:
      return x
    case _:
      return (gen(d-1, x * 2 + 1), gen(d-1, x * 2))

def sum(d, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return sum(d-1, t.a) + sum(d-1, t.b)

def swap(s, a, b):
  switch s:
    case 0:
      return (a,b)
    case _:
      return (b,a)

def warp(d, s, a, b):
  switch d:
    case 0:
      return swap(s + (a > b), a, b)
    case _:
      (a.a,a.b) = a
      (b.a,b.b) = b
      (A.a,A.b) = warp(d-1, s, a.a, b.a)
      (B.a,B.b) = warp(d-1, s, a.b, b.b)
      return ((A.a,B.a),(A.b,B.b))

def flow(d, s, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return down(d, s, warp(d-1, s, t.a, t.b))

def down(d,s,t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return (flow(d-1, s, t.a), flow(d-1, s, t.b))

def sort(d, s, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return flow(d, s, sort(d-1, 0, t.a), sort(d-1, 1, t.b))

def main:
  return sum(18, sort(18, 0, gen(18, 0)))
```

As a test of Bend's ability to parallelize the most insanely high-level
computations possible, let's benchmark this program. Here are the results:

- 12.33s / 102 MIPS (Apple M3 Max, 1 thread)

- 0.96s / 1315 MIPS (Apple M3 Max, 16 threads) - 12x speedup

- 0.24s / 5334 MIPS (NVIDIA RTX 4090, 16k threads) - 51x speedup

And, just like magic, it works! 51x faster on RTX. How cool is that?

Of course, you would absolutely **not** want to sort numbers like that,
specially when mutable arrays exist. But there are many algorithms that *can
not* be implemented easily with buffers. Evolutionary and genetic algorithms,
proof checkers, compilers, interpreters. For the first time ever, you can
implement these algorithms as high-level functions, in a language that runs on
GPUs. That's the magic of Bend!

Graphics Rendering
------------------

While the algorithm above does parallelize well, it is very memory-hungry. It is
a nice demo of Bend's potential, but isn't a great way to sort lists. Currently,
Bend has a 4GB memory limit (for being a 32-bit architecture). When the memory
is filled, its performance will degrade considerably. But we can do better.

Since Bend is GC-free, we can express low memory footprint programs using `bend`
or tail calls. For maximum possible performance, one should first create enough
"parallel room" to fill all available cores, and then spend some time doing
compute-heavy, but less memory-hungry, computations. For example, consider:

```python
# given a shader, returns a square image
def render(depth, shader):
  bend d = 0, i = 0:
    when d < depth:
      color = (fork(d+1, i*2+0), fork(d+1, i*2+1))
    else:
      width = depth / 2
      color = demo_shader(i % width, i / width)
  return color

# given a position, returns a color
# for this demo, it just busy loops
def demo_shader(x, y):
  bend i = 0:
    when i < 100000:
      color = fork(i + 1)
    else:
      color = 0x000001
  return color

# renders a 256x256 image using demo_shader
def main:
  return render(16, demo_shader)
```

It emulates an OpenGL fragment shader by building an "image" as a perfect binary
tree, and then calling the `demo_shader` function on each pixel. Since the tree
has a depth of 16, we have `2^16 = 65536 pixels`, which is enough to fill all
cores of an RTX 4090. Moreover, since `demo_shader` isn't doing many
allocations, it can operate entirely inside the GPU's "shared memory" (L1
cache). Each GPU thread has a local space of 64 IC nodes. Functions that don't
need more than that, like `demo_shader`, can run up to 5x faster!

On my GPU, it performs `22,000 MIPS` out of the box, and `40000+ MIPS` with a
tweak on the generated CUDA file (doubling the `TPC`, which doubles the number
of threads per block). In the near future, we plan to add immutable textures,
allowing for single-interaction sampling. With some napkin math, this should be
enough to render 3D games in real-time. Imagine a future where game engines are
written in Python-like languages? That's the future we're building, with Bend!

To be continued...
------------------

This guide isn't extensive, and there's a lot uncovered. For example, Bend also
has an entire "secret" Haskell-like syntax that is compatible with old HVM1.
[Here](https://gist.github.com/VictorTaelin/9cbb43e2b1f39006bae01238f99ff224) is
an implementation of the Bitonic Sort with Haskell-like equations. We'll
document its syntax here soon!

Community
---------

Remember: Bend is very new and experimental. Bugs and imperfections should be
expected. That said, [HOC](https://HigherOrderCO.com/) will provide long-term
support to Bend (and its runtime, HVM2). So, if you believe this paradigm will
be big someday, and want to be part of it in these early stages, join us on
[Discord](https://Discord.HigherOrderCO.com/). Report bugs, bring your
suggestions, and let's chat and build this future together!