bend-lang 0.2.6

Bend in X minutes - the ultimate guide!
=======================================

Bend is a high-level, massively parallel programming language. That means it
feels like Python, but scales like CUDA. It runs on CPUs and GPUs, and you don't
have to do anything to make it parallel: as long as your code isn't "helplessly
sequential", it **will** use 1000's of threads!

While cool, Bend is far from perfect. In absolute terms it is still not so fast.
Compared to SOTA compilers like GCC or GHC, our code gen is still embarrassingly
bad, and there is a lot to improve. And, of course, in this beginning, there
will be tons of instability and bugs. That said, it does what it promises:
scaling horizontally with cores. And that's really cool! If you'd like to be an
early adopter of this interesting tech, this guide will teach you how to apply
Bend to build parallel programs in a new way!

For a more technical dive, check HVM2's
[paper](http://paper.HigherOrderCO.com/). For an entertaining, intuitive
explanation, see HVM1's classic
[HOW.md](https://github.com/HigherOrderCO/HVM/blob/master/guide/HOW.md). But if
you just want to dive straight into action - this guide is for you. Let's go!

Installation
------------

To use Bend, first, install [Rust nightly](https://www.oreilly.com/library/view/rust-programming-by/9781788390637/e07dc768-de29-482e-804b-0274b4bef418.xhtml). Then, install HVM2 and Bend itself with:

```
cargo +nightly install hvm
cargo +nightly install bend-lang
```

To test if it worked, type:

```
bend --help
```

For GPU support, you also need CUDA version `12.X`.

Hello, World!
-------------

As we said, Bend *feels* like Python - in some ways. It is high-level, you can
easily create objects and lists, there are ifs and loops. Yet, it is different:
there is some Haskell in it, in the sense algebraic datatypes, pattern-matching
and recursion play an important role. This is how its  `"Hello, world!"` looks:

```python
def main():
  return "Hello, world!"
```

Wait - there is something strange there. Why `return`, not `print`? Well, *for
now* (you'll read these words a lot), Bend doesn't have IO. We plan on
introducing it very soon! So, *for now*, all you can do is perform computations,
and see results. To run the program above, type:

```
bend run main.bend
```

If all goes well, you should see `"Hello, world!"`. The `bend run` command uses
the reference interpreter, which is slow. In a few moments, we'll teach you how
to run your code in parallel, on both CPUs and GPUs. For now, let's learn some
fundamentals!

Basic Functions and Datatypes
-----------------------------

In Bend, functions are pure: they receive something, and they return something.
That's all. Here is a function that tells you how old you are:

```python
def am_i_old(age):
  if age < 18:
    return "you're a kid"
  else:
    return "you're an adult"

def main():
  return am_i_old(32)
```

That is simple enough, isn't it? Here is one that returns the distance between
two points:

```python
def distance(ax, ay, bx, by):
  dx = bx - ax
  dy = by - ay
  return (dx * dx + dy * dy) ** 0.5

def main():
  return distance(10.0, 10.0, 20.0, 20.0)
```

This isn't so pretty. Could we use tuples instead? Yes:

```python
def distance(a, b):
  (ax, ay) = a
  (bx, by) = b
  dx = bx - ax
  dy = by - ay
  return (dx * dx + dy * dy) ** 0.5

def main():
  return distance((10.0, 10.0), (20.0, 20.0))
```

So far, this does look like Python, doesn't it? What about objects? Well - here,
there is a difference. In Python, we have classes. In Bend, we just have the
objects themselves. This is how we create a 2D vector:

```python
object V2 { x, y }

def distance(a, b):
  open V2: a
  open V2: b
  dx = b.x - a.x
  dy = b.y - a.y
  return (dx * dx + dy * dy) ** 0.5

def main():
  return distance(V2 { x: 10.0, y: 10.0 }, V2 { x: 20.0, y: 20.0 })
```

This doesn't look too different, does it? What is that `open` thing, though? It
just tells Bend to *consume* the vector, `a`, "splitting" it into its
components, `a.x` and `a.y`. Is that really necessary? Actually, no - not
really. But, *for now*, it is. This has to do with the fact Bend is an affine
language, which... well, let's not get into that. For now, just remember we
need `open` to access fields.

Bend comes with 3 built-in numeric types: `u24`, `i24`, `f24`. That's quite
small, we admit. Soon, we'll have larger types. For now, that's what we got.
The `u24` type is written like `123` or `0xF`. The `i24` type requires a sign,
as in, `+7` or `-7`. The `f24` type uses `.`, like `3.14`.

Other than tuples, Bend has another, very general, way to encode data:
datatypes! These are just "objects with tags". A classic example of this is a
"shape", which can be either a circle or a rectangle. It is defined like this:

```python
type Shape:
  Circle { radius }
  Rectangle { width, height }

def area(shape):
  match shape:
    case Shape/Circle:
      return 3.14 * shape.radius ** 2.0
    case Shape/Rectangle:
      return shape.width * shape.height

def main:
  return area(Shape/Circle { radius: 10.0 })
```

In this example, `Shape` is a datatype with two variants: `Circle` and
`Rectangle`.  The `area` function uses pattern matching to handle each variant
appropriately.  Just like objects need `open`, datatypes need `match`, which
give us access to fields in each respective case.

Datatypes are very general. From matrices, to JSON, to quadtrees, every type of
data can be represented as a datatype (I mean, that's the name!). In fact,
lists - which, on Python, are actually stored as arrays - are represented using
datatypes on Bend. Specifically, the type:

```python
type List:
  Nil
  Cons { head, tail }
```

Here, the `Nil` variant represents an empty list, and the `Cons` variant
represents a concatenation between an element (`head`) and another list
(`tail`). That way, the `[1,2,3]` list could be written as:

```python
def main:
  my_list = List/Cons { head: 1, tail: List/Cons { head: 2, tail: List/Cons { head: 3, tail: List/Nil }}}
  return my_list
```

Obviously - that's terrible. So, you can write just instead:

```python
def main:
  my_list = [1, 2, 3]
  return my_list
```

Which is decent. But while it is written the same as in Python, it is important
to understand it is just the `List` datatype, which means we can operate on it
using the `match` notation. For example:

```python
def main:
  my_list = [1, 2, 3]
  match my_list:
    case List/Cons:
      return my_list.head
    case List/Nil:
      return 0
```

Will return `1`, which is the first element.

We also have a syntax sugar for strings in Bend, which is just a List of `u24`
characters (UTF-16 encoded). The `"Hello, world!"` type we've seen used it!

Bend also has inline functions, which work just like Python:

```python
def main:
  mul_2 = lambda x: x * 2
  return mul_2(7)
```

Except without the annoying syntax restrictions. You can also shorten it as `λ`,
if you can somehow type that.

*note - some types, like tuples, aren't being pretty-printed correctly after
computation. this will be fixed in the next days (TM)*

The Dreaded Immutability
------------------------

Finally, let's get straight to the fun part: how do we implement parallel
algorithms with Bend? Just kidding. Before we get there, let's talk about loops.
You might have noticed we have avoided them so far. That wasn't by accident.
There is an important aspect on which Bend diverges from Python, and aligns with
Haskell: **variables are immutable**. Not "by default". They just **are**. For
example, in Bend, we're not allowed to write:

```python
def parity(x):
  result = "odd"
  if x % 2 == 0:
    result = "even"
  return result
```

... because that would mutate the `result` variable. Instead, we should write:

```python
def is_even(x):
  if x % 2 == 0:
    return "even"
  else:
    return "odd"

def main:
  return is_even(7)
```

Which is immutable. If that sounds annoying, that's because **it is**. Don't
let anyone tell you otherwise. We are aware of that, and we have many ideas on
how to improve this, making Bend feel even more Python-like. For now, we have to
live with it. But, wait... if variables are immutable... how do we even do
loops? For example:

```python
def sum(x):
  total = 0 
  for i in range(10)
    total += i
  return total
```

Here, the entire way the algorithm works is by mutating the `total` variable.
Without mutability, loops don't make sense. The good news is Bend has *something
else* that is equally as - actually, more - powerful. And learning it is really
worth your time. Let's do it!

Folds and Bends
---------------

### Recursive Datatypes

Let's start by implementing a recursive datatype in Bend:

```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }
```

This defines a binary tree, with elements on leaves. For example, the tree:

```
  __/\__
 /\     /\
1  2   3  4
```

Could be represented as:

```
tree = Tree/Node {
  lft: Tree/Node { lft: Tree/Leaf { val: 1 }, rgt: Tree/Leaf { val: 2 } },
  rgt: Tree/Node { lft: Tree/Leaf { val: 3 }, rgt: Tree/Leaf { val: 4 } }
}
```

That's ugly. Very soon, we'll add a syntax sugar to make this shorter:

```
tree = ![![1,2],![3,4]]
```

As usual, for now, we'll live with the longer version.

### Fold: consuming recursive datatypes

Now, here's a question: how do we *sum* the elements of a tree? In Python, we
could just use a loop. In Bend, we don't have loops. Fortunately, there is
another construct we can use: it's called `fold`, and it works like a *search
and replace* for datatypes. For example, consider the code below:


```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }

def sum(tree):
  fold tree:
    case Tree/Node:
      return tree.lft + tree.rgt
    case Tree/Leaf:
      return tree.val

def main:
  tree = Tree/Node {
    lft: Tree/Node { lft: Tree/Leaf { val: 1 }, rgt: Tree/Leaf { val: 2 } },
    rgt: Tree/Node { lft: Tree/Leaf { val: 3 }, rgt: Tree/Leaf { val: 4 } }
  }
  return sum(tree)
```

It accomplishes the task by replacing every `Tree/Node { lft, rgt }` by `lft +
rgt`, and replacing every `Tree/Leaf` by `val`. As a result, the entire "tree of
values" is turned into a "tree of additions", and it evaluates as follows:

```python
nums = ((1 + 2) + (3 + 4))
nums = (3 + 7)
nums = 10
```

Now, this may look limiting, but it actually isn't. Folds are known for being
universal: *any algorithm that can be implemented with a loop, can be
implemented with a fold*. So, we can do much more than just compute an
"aggregated value". Suppose we wanted, for example, to transform every element
into a tuple of `(index,value)`, returning a new tree. Here's how to do it:

```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }

def enum(tree):
  idx = 0
  fold tree with idx:
    case Tree/Node:
      return Tree/Node {
        lft: tree.lft(idx * 2 + 0),
        rgt: tree.rgt(idx * 2 + 1),
      }
    case Tree/Leaf:
      return (idx, tree.val)

def main:
  tree = Tree/Node {
    lft: Tree/Node { lft: Tree/Leaf { val: 1 }, rgt: Tree/Leaf { val: 2 }, },
    rgt: Tree/Node { lft: Tree/Leaf { val: 3 }, rgt: Tree/Leaf { val: 4 }, }
  }
  return sum(tree)
```

Compared to the `sum` algorithm, 3 important things changed:

1. We initialize a state, `idx`, as `0`.

2. We pass new states down as `tree.xyz(new_idx)`

3. The base case receives the final state: the element index

So, in the end, we'll have computed a copy of the original tree, except that
every element has now became a tuple of index and value.

Now, please take a moment to think about this fact: **everything can be computed
with a fold.** This idea often takes some time to get used to, but, once you do,
it is really liberating, and will let you write better algorithms. As an
exercise, use `fold` to implement a "reverse" algorithm for lists:

```python
def reverse(list):
  # exercise
  ?

def main:
  return reverse([1,2,3])
```

## Bend: generating recursive datatypes

Bending is the opposite of folding. Whatever `fold` consumes, `bend` creates.
The idea is that, by defining an *initial state* and a *halting condition*, we
can "grow" a recursive structure, layer by layer, until the condition is met. 
For example, consider the code below:

```python
type Tree:
  Node { ~lft, ~rgt }
  Leaf { val }

def main():
  bend x = 0:
    when x < 3:
      tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
    else:
      tree = Tree/Leaf { val: 7 }
  return tree
```

The program above will initialize a state (`x = 0`), and then, for as long as `x
< 3`, it will "fork" that state in two, creating a `Tree/Node`, and continuing
with `x + 1`. When `x >= 3`, it will halt and return a `Tree/Leaf` with `7`.
When all is done, the result will be assigned to the `tree` variable:

```python
tree = fork(0)
tree = ![fork(1), fork(1)]
tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]]
tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]
```

With some imagination, we can easily see that, by recursively unrolling a state
this way, we can generate any structure we'd like. In fact, `bend` is so general
we can even use it to emulate a loop. For example, this Python program:

```python
sum = 0
idx = 0
while idx < 10:
  sum = idx + sum
  idx = idx + 1
```

Could be emulated in Bend with a "sequential bend":

```python
bend idx = 0:
  when idx < 10:
    sum = idx + fork(idx + 1)
  else:
    sum = 0
```

Of course, if you do it, Bend's devs will be very disappointed with you. Why?
Because everyone is here for one thing. Let's do it!

Parallel "Hello, World"
-----------------------

So, after all this learning, we're now ready to answer the ultimate question:

**How do we write parallel algorithms in Bend?**

At this point, you might have the idea: by using *folds* and *bends*, right?
Well... actually not! You do not need to use these constructs at all to make it
happen. Anything that *can* be parallelized *will* be parallelized on Bend. To
be more precise, this:

```
f(g(x))
```

Can NOT be parallelized, because `f` **depends** on the result of `g`. But this:

```
H(f(x), g(y))
```

Can be parallelized, because `f(x)` and `g(y)` are **independent**. Traditional
loops, on the other hands, are inherently sequential. A loop like:

```python
sum = 0
for i in range(8):
  sum += i
```

Is actually just a similar way to write:

```python
sum = (0 + (1 + (2 + (3 + (4 + (5 + (6 + 7)))))))
```

Which is *really bad* for parallelism, because the only way to compute this is
by evaluating the expressions one after the other, in order:

```python
sum = (0 + (1 + (2 + (3 + (4 + (5 + (6 + 7)))))))
sum = (0 + (1 + (2 + (3 + (4 + (5 + 13))))))
sum = (0 + (1 + (2 + (3 + (4 + 18)))))
sum = (0 + (1 + (2 + (3 + 22))))
sum = (0 + (1 + (2 + 25)))
sum = (0 + (1 + 27))
sum = (0 + 28)
sum = 28
```

There is nothing Bend could do to save this program: sequentialism is an
inherent part of its logic. Now, if we had written, instead:

```python
sum = (((0 + 1) + (2 + 3)) + ((4 + 5) + (6 + 7)))
```

Then, we'd have a much easier time evaluating that in parallel. Look at it:

```python
sum = (((0 + 1) + (2 + 3)) + ((4 + 5) + (6 + 7)))
sum = ((1 + 5) + (9 + 13))
sum = (6 + 22)
sum = 28
```

That's so much better that even the *line count* is shorter!

So, how do you write a parallel program in Bend?

**Just write algorithms that aren't helplessly sequential.**

That's all there is to it. As long as you write programs like that one, then
unlike the former one, they will run in parallel. And that's why `bend` and
`fold` are core features: they're, essentially, parallelizable loops. For
example, to add numbers in parallel, we can write:

```python
def main():
  bend d = 0, i = 0:
    when d < 28:
      sum = fork(d+1, i*2+0) + fork(d+1, i*2+1)
    else:
      sum = i
  return sum
```

And that's the parallel "Hello, world"! Now, let's finally run it. But first,
let's measure its single-core performance. Also, remember that, for now, Bend
only supports 24-bit numbers (`u24`), thus, the results will always be in `mod
16777216`.

```
bend run main.bend
```

On my machine (Apple M3 Max), it completes after `147s`, at `65 MIPS` (Million
Interactions Per Second - Bend's version of the FLOPS). That's too long. Let's
run it in parallel, by using the **C interpreter** instead:

```
bend run-c main.bend
```

And, just like that, the same program now runs in `8.49s`, at `1137 MIPS`.
That's **18x faster**! Can we do better? Sure: let's use the **C compiler** now:

```
bend gen-c main.bend >> main.c
```

This command converts your `bend` file into a small, dependency-free C file
that does the same computation much faster. You can compile it to an executable:

```
gcc main.c -o main -O2 -lm -lpthread # if you're on Linux
gcc main.c -o main -O2               # if you're on OSX
./main
```

Now, the same program runs in `5.81s`, at `1661.91 MIPS`. That's now **25x
faster** than the original! Can we do better? Let's now enter the unexplored
realms of arbitrary high-level programs on... GPUs. How hard that could be?
Well, for us... it was. A lot. For you... just call the **CUDA interpreter**:

```
bend run-cu main.bend
```

And, simply as that, the same program now runs in `0.82s`, at a blistering
`11803.24 MIPS`. That's **181x faster** than the original. Congratulations!
You're now a thread bender.

~

As a last note, you may have noticed that the compiled version isn't much faster
than the interpreted one. Our compiler is still on its infancy, and the assembly
generated is quite abysmal. Most of our effort went into setting up a foundation
for the parallel evaluator, which was no easy task. With that out of our way,
improving the compiler is a higher priority now. You can expect it to improve
continuously over time. For now, it is important to understand the state of
things, and set up reasonable expectations.

A Parallel Bitonic Sort
-----------------------

The bitonic sort is a popular algorithm that sorts a set of numbers by moving
them through a "circuit" (sorting network) and swapping as they pass through:

![bsort](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/BitonicSort1.svg/1686px-BitonicSort1.svg.png)

In CUDA, this can be implemented by using mutable arrays and synchronization
primitives. This is well known. What is less known is that it can also be
implemented as a series of *immutable tree rotations*, with pattern-matching and
recursion. Don't bother trying to understand it, but, here's the code:

```python
def gen(d, x):
  switch d:
    case 0:
      return x
    case _:
      return (gen(d-1, x * 2 + 1), gen(d-1, x * 2))

def sum(d, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return sum(d-1, t.a) + sum(d-1, t.b)

def swap(s, a, b):
  switch s:
    case 0:
      return (a,b)
    case _:
      return (b,a)

def warp(d, s, a, b):
  switch d:
    case 0:
      return swap(s + (a > b), a, b)
    case _:
      (a.a,a.b) = a
      (b.a,b.b) = b
      (A.a,A.b) = warp(d-1, s, a.a, b.a)
      (B.a,B.b) = warp(d-1, s, a.b, b.b)
      return ((A.a,B.a),(A.b,B.b))

def flow(d, s, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return down(d, s, warp(d-1, s, t.a, t.b))

def down(d,s,t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return (flow(d-1, s, t.a), flow(d-1, s, t.b))

def sort(d, s, t):
  switch d:
    case 0:
      return t
    case _:
      (t.a, t.b) = t
      return flow(d, s, sort(d-1, 0, t.a), sort(d-1, 1, t.b))

def main:
  return sum(18, sort(18, 0, gen(18, 0)))
```

As a test of Bend's ability to parallelize the most insanely high-level
computations possible, let's benchmark this program. Here are the results:

- 12.33s / 102 MIPS (Apple M3 Max, 1 thread)

- 0.96s / 1315 MIPS (Apple M3 Max, 16 threads) - 12x speedup

- 0.24s / 5334 MIPS (NVIDIA RTX 4090, 16k threads) - 51x speedup

And, just like magic, it works! 51x faster on RTX. How cool is that?

Of course, you would absolutely **not** want to sort numbers like that,
specially when mutable arrays exist. But there are many algorithms that *can
not* be implemented easily with buffers. Evolutionary and genetic algorithms,
proof checkers, compilers, interpreters. For the first time ever, you can
implement these algorithms as high-level functions, in a language that runs on
GPUs. That's the magic of Bend!

Graphics Rendering
------------------

While the algorithm above does parallelize well, it is very memory-hungry. It is
a nice demo of Bend's potential, but isn't a great way to sort lists. Currently,
Bend has a 4GB memory limit (for being a 32-bit architecture). When the memory
is filled, its performance will degrade considerably. But we can do better.

Since Bend is GC-free, we can express low memory footprint programs using `bend`
or tail calls. For maximum possible performance, one should first create enough
"parallel room" to fill all available cores, and then spend some time doing
compute-heavy, but less memory-hungry, computations. For example, consider:

```python
# given a shader, returns a square image
def render(depth, shader):
  bend d = 0, i = 0:
    when d < depth:
      color = (fork(d+1, i*2+0), fork(d+1, i*2+1))
    else:
      width = depth / 2
      color = demo_shader(i % width, i / width)
  return color

# given a position, returns a color
# for this demo, it just busy loops
def demo_shader(x, y):
  bend i = 0:
    when i < 100000:
      color = fork(i + 1)
    else:
      color = 0x000001
  return color

# renders a 256x256 image using demo_shader
def main:
  return render(16, demo_shader)
```

It emulates an OpenGL fragment shader by building an "image" as a perfect binary
tree, and then calling the `demo_shader` function on each pixel. Since the tree
has a depth of 16, we have `2^16 = 65536 pixels`, which is enough to fill all
cores of an RTX 4090. Moreover, since `demo_shader` isn't doing many
allocations, it can operate entirely inside the GPU's "shared memory" (L1
cache). Each GPU thread has a local space of 64 IC nodes. Functions that don't
need more than that, like `demo_shader`, can run up to 5x faster!

On my GPU, it performs `22,000 MIPS` out of the box, and `40000+ MIPS` with a
tweak on the generated CUDA file (doubling the `TPC`, which doubles the number
of threads per block). In the near future, we plan to add immutable textures,
allowing for single-interaction sampling. With some napkin math, this should be
enough to render 3D games in real-time. Imagine a future where game engines are
written in Python-like languages? That's the future we're building, with Bend!

To be continued...
------------------

This guide isn't extensive, and there's a lot uncovered. For example, Bend also
has an entire "secret" Haskell-like syntax that is compatible with old HVM1.
[Here](https://gist.github.com/VictorTaelin/9cbb43e2b1f39006bae01238f99ff224) is
an implementation of the Bitonic Sort with Haskell-like equations. We'll
document its syntax here soon!

Community
---------

Remember: Bend is very new and experimental. Bugs and imperfections should be
expected. That said, [HOC](https://HigherOrderCO.com/) will provide long-term
support to Bend (and its runtime, HVM2). So, if you believe this paradigm will
be big someday, and want to be part of it in these early stages, join us on
[Discord](https://Discord.HigherOrderCO.com/). Report bugs, bring your
suggestions, and let's chat and build this future together!