typed-ski 0.9.2

# typed-ski Rust (WASM arena)

This directory contains the Rust implementation of the arena allocator and
reducer used by typed-ski's WASM evaluators.

## Overview

**This Rust code targets `wasm32-unknown-unknown` and runs inside WebAssembly.**

It provides:

- **Hash-consing arena allocation** with structural sharing for memory
  efficiency
- **Iterative reduction algorithm** to prevent stack overflow on deep
  expressions
- **Preemptive multitasking** with cooperative yielding for fair worker
  scheduling
- **Lock-free ring buffers** (io_uring-style) for efficient inter-thread
  communication
- **[SharedArrayBuffer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer)**
  support for true parallel evaluation across Web Workers
- **Dynamic arena growth** with seqlock-based synchronization

## Architecture

The arena allocator supports two modes:

- **Single-threaded (Heap Mode)**: Uses local WASM memory with lazy allocation
- **Multi-threaded (SAB Mode)**: Uses
  [WebAssembly shared memory][wasm-shared-memory] (SharedArrayBuffer) to enable
  concurrent access from multiple Web Workers, with atomic operations and
  lock-free synchronization for thread safety

Hash consing is implemented using a hash table with chaining, ensuring that
expressions with the same structure (same left and right children) are
automatically deduplicated. The arena can grow dynamically up to a fixed maximum
capacity.

### Thread Safety and Locking

In SAB (SharedArrayBuffer) mode, the arena uses lock-free and atomic-based
concurrency control to enable high-performance concurrent access:

#### Optimistic Concurrency Control

Hash-consing allocation uses **optimistic concurrency control**:

1. **Optimistic read**: Traverse bucket chain without locking
2. **Pre-allocate ID**: Use `top.fetch_add()` to reserve slot atomically
3. **CAS insertion**: Attempt to insert into hash table bucket
4. **Graceful failure**: If CAS fails, slot becomes a "hole" (`kind = 0`)

**Correctness guarantee**: Canonical node IDs are always returned. Readers treat
non-`NonTerm` kinds as terminal values, so holes don't affect evaluation.

#### Lock-Free Ring Buffers

Inter-thread communication uses **lock-free ring buffers** (io_uring-style):

- **Submission Queue (SQ)**: Main thread → Worker communication
- **Completion Queue (CQ)**: Worker → Main thread results
- **Atomic operations**: Wait-free producer/consumer patterns
- **Sequence numbers**: ABA prevention in concurrent access
- **Blocking waits**: Efficient WASM atomic wait/notify

#### Resize Synchronization

Arena growth uses a **seqlock-style approach**:

- **Stop-the-world pauses**: All threads spin during resize operations
- **Sequence lock**: Odd values indicate resize in progress, even values
  indicate stable
- **Reverse-order copying**: Prevents overlap issues during memory migration
- **Bucket rebuild**: Hash table reconstructed after resize
- **Poisoning on failure**: Unrecoverable errors set poison sequences

[wasm-shared-memory]: https://webassembly.github.io/spec/core/syntax/modules.html#memories
[soa]: https://en.wikipedia.org/wiki/AoS_and_SoA

## Key Optimizations

### 1. Hash-Consing (Structural Sharing)

The arena uses [hash consing](https://en.wikipedia.org/wiki/Hash_consing) to
deduplicate identical subexpressions:

- **Avalanche hashing**: Fast, high-quality hash function for node pairs
- **[Separate chaining](https://en.wikipedia.org/wiki/Hash_table#Separate_chaining)**:
  Collision resolution in hash table buckets
- **Memory efficiency**:
  [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation
  instead of tree (common subexpressions shared)
- **O(1) lookups**: Fast deduplication during expression construction

### 2. Iterative Reduction Algorithm

Avoids recursion stack overflow on deep expressions:

- **Explicit stack management**: Uses Continuation nodes instead of call stack
- **Two-phase reduction**: Left child first, then right child
- **Memory recycling**: Dead continuation frames reused immediately
- **Scalable depth**: No theoretical limit on expression depth

### 3. Preemptive Multitasking

Enables fair scheduling in parallel evaluation:

- **Gas-based yielding**: Workers yield control when traversal budget exhausted
- **Suspension nodes**: Capture complete evaluation state for resumption
- **Cooperative scheduling**: Prevents worker starvation
- **State serialization**: Evaluation context stored directly in arena

### 4. Lock-Free Ring Buffers

High-performance inter-thread communication:

- **[io_uring](https://en.wikipedia.org/wiki/Io_uring)-style design**:
  Submission and completion queues
- **Atomic operations**:
  [Wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm)
  producer/consumer patterns
- **[Sequence numbers](https://en.wikipedia.org/wiki/ABA_problem)**: ABA
  prevention in concurrent access
- **[Blocking waits](https://en.wikipedia.org/wiki/Atomic_semaphore)**:
  Efficient WASM atomic wait/notify

## Arena Node Types

The arena supports four node types for different purposes:

- **`Terminal`**: SKI combinators (S, K, I) - leaf nodes
- **`NonTerm`**: Function application `(left right)` - core expressions
- **`Continuation`**: Stack frames for iterative reduction - optimization
- **`Suspension`**: Paused evaluation state for multitasking - optimization

## Execution Model (Host ↔ Workers)

- The host process (JS/TS) is responsible for **building expressions in the
  arena** (via `allocTerminal`/`allocCons`).
- The host then submits work to workers using the SQ/CQ rings.
- Workers run an exported infinite loop (`workerLoop`) that:
  - blocks waiting for SQ entries (`memory.atomic.wait32`)
  - reduces the submitted root using iterative algorithm with gas limits
  - yields control via Suspensions when gas exhausted (preemptive multitasking)
  - enqueues a CQ completion when reduction finishes or yields

### Correlation IDs

Work is correlated using a 32-bit `req_id` chosen by the host.

- SQ entry: `{ node_id: u32, req_id: u32 }`
- CQ entry: `{ node_id: u32, req_id: u32 }`

Completions may arrive out-of-order; `req_id` is how the host matches results to
callers.

## Thread Safety & Concurrency

### Hash-Consing Under Concurrency

`allocCons` implements optimistic concurrency control:

1. **Optimistic read**: Traverse bucket chain without locking
2. **Pre-allocate ID**: Use `top.fetch_add()` to reserve slot
3. **[CAS](https://en.wikipedia.org/wiki/Compare-and-swap) insertion**: Attempt
   to insert into hash table bucket
4. **Graceful failure**: If CAS fails, slot becomes a "hole" (`kind = 0`)

**Correctness guarantee**: Canonical node IDs are always returned. Readers treat
non-`NonTerm` kinds as terminal values, so holes don't affect evaluation.

### Stop-the-World Resizing

Uses [seqlock](https://en.wikipedia.org/wiki/Seqlock)-style synchronization for
arena growth:

- **Odd sequence**: Resize in progress (writers spin, readers retry)
- **Array migration**: Reverse-order copying prevents overlap issues
- **Bucket rebuild**: Hash table reconstructed after resize
- **Even sequence**: Stable state (normal operation resumes)

### Ring Buffer Concurrency

- **Single producer/consumer**: Each ring has dedicated thread roles
- **[ABA prevention](https://en.wikipedia.org/wiki/ABA_problem)**: Sequence
  numbers prevent concurrent access issues
- **[Memory barriers](https://en.wikipedia.org/wiki/Memory_barrier)**: Proper
  acquire/release ordering for visibility
- **[Wait/notify](https://en.wikipedia.org/wiki/Monitor_(synchronization))**:
  Efficient blocking with WASM atomic operations

### Shared Memory Safety

- **Typed array views**: JS-side access to arena arrays
- **Cache coherence**: 64-byte alignment prevents false sharing
- **Poisoning on failure**: Unrecoverable errors set poison sequences
- **Cross-thread visibility**: SharedArrayBuffer enables true parallelism

## Performance Characteristics

### Time Complexity

- **Allocation**: O(1) amortized (hash table lookup + potential growth)
- **Hash-consing lookup**: O(1) average case (hash table with chaining)
- **Reduction step**: O(depth) with iterative algorithm (no stack overflow)
- **Ring operations**: O(1) (lock-free with atomic CAS)

### Space Complexity

- **Per node**: ~16 bytes (4 arrays × 4 bytes + metadata)
- **Hash table**: O(capacity) for buckets and collision chains
- **Structural sharing**: Significant savings for common subexpressions
- **Memory efficiency**: ~60% less memory than naive tree representation

### Scalability

- **Expression depth**: Unlimited (iterative vs recursive)
- **Parallel workers**: Linear scaling with CPU cores
- **Memory growth**: Dynamic resizing with seqlock synchronization
- **Concurrent allocations**: Lock-free with optimistic concurrency

## Memory Layout

The shared arena is laid out as **header + fixed rings + SoA arrays**.

### Header (`SabHeader`)

- Magic number, ring parameters, memory offsets
- Capacity, bucket masks, atomic counters
- `max_steps`, `resize_seq`, `top` for coordination

### Rings (64-byte aligned)

- **SQ (Submission Queue)**: Host → Worker communication
- **CQ (Completion Queue)**: Worker → Host results
- Lock-free with sequence numbers for ABA prevention

### SoA Arrays (Struct of Arrays)

- `kind: u8[]` - Node type (Terminal/NonTerm/Continuation/Suspension)
- `sym: u8[]` - Symbol/mode data
- `left_id: u32[]` - Left child pointers
- `right_id: u32[] - Right child pointers
- `hash32: u32[]` - Hash values for deduplication
- `next_idx: u32[]` - Hash table collision chains
- `buckets: u32[]` - Hash table bucket heads
- Terminal cache: Fast access to S/K/I nodes

The JS/TS side constructs typed array views over these arrays for fast read-only
decoding and direct memory access.

## Building

The build is managed by Nix. `Cargo.toml` is generated by the flake.

```bash
nix run .#generate-cargo
cargo build --target wasm32-unknown-unknown --release
```

## WASM Exports

### Arena / reducer

- `initArena(capacity: u32) -> u32`: allocate and initialize a shared arena
  region (returns base pointer or error code).
- `connectArena(ptr: u32) -> u32`: connect an instance to an existing arena
  (returns 1 on success).
- `reset()`
- `allocTerminal(sym: u32) -> u32`
- `allocCons(left: u32, right: u32) -> u32`
- `arenaKernelStep(expr: u32) -> u32`
- `reduce(expr: u32, max: u32) -> u32`

### SQ/CQ interface

- `hostSubmit(node_id: u32, req_id: u32, max_steps: u32) -> u32`
  - returns `0` ok, `1` SQ full, `2` not connected
  - `max_steps` is the reduction limit for this specific request (eliminates
    race conditions)
- `hostPull() -> i64`
  - returns `-1` if CQ empty, otherwise a packed value:
    - high 32 bits: `req_id`
    - low 32 bits: `node_id` (result root)
- `workerLoop() -> !`: blocking loop for worker instances (never returns).

### Debug

- `debugLockState() -> u32`: returns `resize_seq & 1` (1 if resizing).
- `getArenaMode() -> u32`: `0` heap mode, `1` SAB mode.
- `debugGetArenaBaseAddr() -> u32`
- `debugCalculateArenaSize(capacity: u32) -> u32`

## WASM Imports

- `memory: WebAssembly.Memory` (provided via `env.memory`) - The shared memory
  instance (must be created with `shared: true` for SAB mode)
- `js_allow_block() -> i32` - Returns `0` for main thread (spin-only) or `1` for
  worker threads (can block). This is instance-local and not stored in shared
  memory to avoid state collisions.

## Further Reading

For deeper understanding of the techniques used:

- **[Hash Consing](https://en.wikipedia.org/wiki/Hash_consing)**:
  Memory-efficient structural sharing
- **[Seqlock](https://en.wikipedia.org/wiki/Seqlock)**: Reader-writer
  synchronization
- **[Compare-and-Swap](https://en.wikipedia.org/wiki/Compare-and-swap)**: Atomic
  operation primitive
- **[Memory Barrier](https://en.wikipedia.org/wiki/Memory_barrier)**: Memory
  ordering guarantees
- **[io_uring](https://en.wikipedia.org/wiki/Io_uring)**: Asynchronous I/O
  framework
- **[SharedArrayBuffer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer)**:
  Cross-thread memory sharing
- **[WebAssembly Atomics](https://webassembly.github.io/threads/js-api/)**:
  Atomic operations in WASM

## Structure

- `src/lib.rs`: crate entry point
- `src/arena.rs`: arena allocator + rings + reducer