Bend
Bend is a massively parallel, high-level programming language.
Unlike low-level alternatives like CUDA and Metal, Bend has the feeling and features of expressive languages like Python and Haskell, including fast object allocations, higher-order functions with full closure support, unrestricted recursion, even continuations. Yet, it runs on massively parallel hardware like GPUs, with near-linear speedup based on core count, and zero explicit parallel annotations: no thread spawning, no locks, mutexes, atomics. Bend is powered by the HVM2 runtime.
Using Bend
First, install Rust nightly. Then, install both HVM2 and Bend with:
Finally, write some Bend file, and run it with one of these commands:
You can also compile Bend
to standalone C/CUDA files with gen-c
and
gen-cu
, for maximum performance. But keep in mind our code gen is still on its
infancy, and is nowhere as mature as SOTA compilers like GCC and GHC.
Parallel Programming in Bend
To write parallel programs in Bend, all you have to do is... nothing. Other than not making it inherently sequential! For example, the expression:
Can not run in parallel, because +4
depends on +3
which
depends on (1+2)
. But the following expression:
Can run in parallel, because (1+2)
and (3+4)
are independent; and it will,
per Bend's fundamental pledge:
Everything that can run in parallel, will run in parallel.
For a more complete example, consider:
:
0:
return
:
= # adds the fst half
= # adds the snd half
return +
return
This code adds all numbers from 0 to 2^30, but, instead of a loop, we use a recursive divide-and-conquer approach. Since this approach is inherently parallel, Bend will run it multi-threaded. Some benchmarks:
-
CPU, Apple M3 Max, 1 thread: 3.5 minutes
-
CPU, Apple M3 Max, 16 threads: 10.26 seconds
-
GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds
That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Bend isn't limited to a specific paradigm, like tensors or actors. Any concurrent system, from shaders to Erlang-like actor models can be emulated on Bend. For example, to render images in real time, we could simply allocate an immutable tree on each frame:
# given a shader, returns a square image
= 0, = 0:
< :
=
= / 2
=
return
# given a position, returns a color
# for this demo, it just busy loops
= 0:
< 100000:
=
= 0x000001
return
# renders a 256x256 image using demo_shader
return
And it would actually work. Even involved algorithms, such as a Bitonic Sort using tree rotations, parallelize well on Bend. Long-distance communication is performed by global beta-reduction (as per the Interaction Calculus), and synchronized correctly and efficiently by HVM2's atomic linker.
-
To jump straight into action, check Bend's GUIDE.md.
-
For an extensive list of features, check FEATURES.md.
-
To understand the tech behind Bend, check HVM2's paper.