Index
Introduction
Bend offers the feel and features of expressive languages like Python and Haskell. This includes fast object allocations, full support for higher-order functions with closures, unrestricted recursion, and even continuations.
Bend scales like CUDA, it runs on massively parallel hardware like GPUs, with nearly linear acceleration based on core count, and without explicit parallelism annotations: no thread creation, locks, mutexes, or atomics.
Bend is powered by the HVM2 runtime.
Important Notes
- Bend is designed to excel in scaling performance with cores, supporting over 10000 concurrent threads.
- The current version may have lower single-core performance.
- You can expect substantial improvements in performance as we advance our code generation and optimization techniques.
- We are still working to support Windows. Use WSL2 as an alternative solution.
- We only support NVIDIA Gpus currently.
Install
Install dependencies
On Linux
# Install Rust if you haven't already.
|
# For the C version of Bend, use GCC. We recommend a version up to 12.x.
For the CUDA runtime install the CUDA toolkit for Linux version 12.x.
On Mac
# Install Rust if you haven't it already.
|
# For the C version of Bend, use GCC. We recommend a version up to 12.x.
Install Bend
- Install HVM2 by running:
# HVM2 is HOC's massively parallel Interaction Combinator evaluator.
# This ensures HVM is correctly installed and accessible.
- Install Bend by running:
# This command will install Bend
# This ensures Bend is correctly installed and accessible.
Getting Started
Running Bend Programs
# Notes
# You can also compile Bend to standalone C/CUDA files using gen-c and gen-cu for maximum performance.
# The code generator is still in its early stages and not as mature as compilers like GCC and GHC.
# You can use the -s flag to have more information on
# Reductions
# Time the code took to run
# Interaction per second (In millions)
Testing Bend Programs
The example below sums all the numbers in the range from start
to target
. It can be written in two different methods: one that is inherently sequential (and thus cannot be parallelized), and another that is easily parallelizable. (We will be using the -s
flag in most examples, for the sake of visibility)
Sequential version:
First, create a file named sequential_sum.bend
# Write this command on your terminal
Then with your text editor, open the file sequential_sum.bend
, copy the code below and paste in the file.
# Defines the function Sum with two parameters: start and target
# If the value of start is the same as target, returns start.
return
# If start is not equal to target, recursively call Sum with
# start incremented by 1, and add the result to start.
return +
# This translates to (1 + (2 + (3 + (...... + (999999 + 1000000)))))
# Note that this will overflow the maximum value of a number in Bend
return
Running the file
You can run it using Rust interpreter (Sequential)
Or you can run it using C interpreter (Sequential)
If you have a NVIDIA GPU, you can also run in CUDA (Sequential)
In this version, the next value to be calculated depends on the previous sum, meaning that it cannot proceed until the current computation is complete. Now, let's look at the easily parallelizable version.
Parallelizable version:
First close the old file and then proceed to your terminal to create parallel_sum.bend
# Write this command on your terminal
Then with your text editor, open the file parallel_sum.bend
, copy the code below and paste in the file.
# Defines the function Sum with two parameters: start and target
# If the value of start is the same as target, returns start.
return
# If start is not equal to target, calculate the midpoint (half),
# then recursively call Sum on both halves.
= / 2
= # (Start -> Half)
=
return +
# A parallelizable sum of numbers from 1 to 1000000
# This translates to (((1 + 2) + (3 + 4)) + ... (999999 + 1000000)...)
return
In this example, the (3 + 4) sum does not depend on the (1 + 2), meaning that it can run in parallel because both computations can happen at the same time.
Running the file
You can run it using Rust interpreter (Sequential)
Or you can run it using C interpreter (Parallel)
If you have a NVIDIA GPU, you can also run in CUDA (Massively parallel)
In Bend, it can be parallelized by just changing the run command. If your code can run in parallel it will run in parallel.
Speedup Examples
The code snippet below implements a bitonic sorter with immutable tree rotations. It's not the type of algorithm you would expect to run fast on GPUs. However, since it uses a divide and conquer approach, which is inherently parallel, Bend will execute it on multiple threads, no thread creation, no explicit lock management.
Bitonic Sorter Benchmark
bend run
: CPU, Apple M3 Max: 12.15 secondsbend run-c
: CPU, Apple M3 Max: 0.96 secondsbend run-cu
: GPU, NVIDIA RTX 4090: 0.21 seconds
# Sorting Network = just rotate trees!
:
0:
return
:
=
=
=
return
# Rotates sub-trees (Blue/Green Box)
:
0:
return
:
=
return
# Swaps distant values (Red Box)
:
0:
return
:
=
=
=
=
return
# Propagates downwards
:
0:
return
:
=
return
# Swaps a single pair
:
0:
return
:
return
# Testing
# -------
# Generates a big tree
:
0:
return
:
return
# Sums a big tree
:
0:
return
:
=
return +
# Sorts a big tree
return
if you are interested in some other algorithms, you can check our examples folder
Additional Resources
- To understand the technology behind Bend, check out the HVM2 paper.
- We are working on an official documentation, meanwhile for a more in depth explanation check GUIDE.md
- Read about our features at FEATURES.md
- Bend is developed by HigherOrderCO - join our Discord!