typed-ski 0.8.0

SKI calculus evaluator in Rust compiled to WASM
Documentation

typed-ski Rust (WASM arena)

This directory contains the Rust implementation of the arena allocator and reducer used by typed-ski's WASM evaluators.

Overview

This Rust code targets wasm32-unknown-unknown and runs inside WebAssembly.

It provides:

  • Hash-consing arena allocation with structural sharing for memory efficiency
  • Iterative reduction algorithm to prevent stack overflow on deep expressions
  • Preemptive multitasking with cooperative yielding for fair worker scheduling
  • Lock-free ring buffers (io_uring-style) for efficient inter-thread communication
  • SharedArrayBuffer support for true parallel evaluation across Web Workers
  • Dynamic arena growth with seqlock-based synchronization

Architecture

The arena allocator supports two modes:

  • Single-threaded (Heap Mode): Uses local WASM memory with lazy allocation
  • Multi-threaded (SAB Mode): Uses WebAssembly shared memory (SharedArrayBuffer) to enable concurrent access from multiple Web Workers, with atomic operations and lock-free synchronization for thread safety

Hash consing is implemented using a hash table with chaining, ensuring that expressions with the same structure (same left and right children) are automatically deduplicated. The arena can grow dynamically up to a fixed maximum capacity.

Thread Safety and Locking

In SAB (SharedArrayBuffer) mode, the arena uses lock-free and atomic-based concurrency control to enable high-performance concurrent access:

Optimistic Concurrency Control

Hash-consing allocation uses optimistic concurrency control:

  1. Optimistic read: Traverse bucket chain without locking
  2. Pre-allocate ID: Use top.fetch_add() to reserve slot atomically
  3. CAS insertion: Attempt to insert into hash table bucket
  4. Graceful failure: If CAS fails, slot becomes a "hole" (kind = 0)

Correctness guarantee: Canonical node IDs are always returned. Readers treat non-NonTerm kinds as terminal values, so holes don't affect evaluation.

Lock-Free Ring Buffers

Inter-thread communication uses lock-free ring buffers (io_uring-style):

  • Submission Queue (SQ): Main thread → Worker communication
  • Completion Queue (CQ): Worker → Main thread results
  • Atomic operations: Wait-free producer/consumer patterns
  • Sequence numbers: ABA prevention in concurrent access
  • Blocking waits: Efficient WASM atomic wait/notify

Resize Synchronization

Arena growth uses a seqlock-style approach:

  • Stop-the-world pauses: All threads spin during resize operations
  • Sequence lock: Odd values indicate resize in progress, even values indicate stable
  • Reverse-order copying: Prevents overlap issues during memory migration
  • Bucket rebuild: Hash table reconstructed after resize
  • Poisoning on failure: Unrecoverable errors set poison sequences

Key Optimizations

1. Hash-Consing (Structural Sharing)

The arena uses hash consing to deduplicate identical subexpressions:

  • Avalanche hashing: Fast, high-quality hash function for node pairs
  • Separate chaining: Collision resolution in hash table buckets
  • Memory efficiency: DAG representation instead of tree (common subexpressions shared)
  • O(1) lookups: Fast deduplication during expression construction

2. Iterative Reduction Algorithm

Avoids recursion stack overflow on deep expressions:

  • Explicit stack management: Uses Continuation nodes instead of call stack
  • Two-phase reduction: Left child first, then right child
  • Memory recycling: Dead continuation frames reused immediately
  • Scalable depth: No theoretical limit on expression depth

3. Preemptive Multitasking

Enables fair scheduling in parallel evaluation:

  • Gas-based yielding: Workers yield control when traversal budget exhausted
  • Suspension nodes: Capture complete evaluation state for resumption
  • Cooperative scheduling: Prevents worker starvation
  • State serialization: Evaluation context stored directly in arena

4. Lock-Free Ring Buffers

High-performance inter-thread communication:

Arena Node Types

The arena supports four node types for different purposes:

  • Terminal: SKI combinators (S, K, I) - leaf nodes
  • NonTerm: Function application (left right) - core expressions
  • Continuation: Stack frames for iterative reduction - optimization
  • Suspension: Paused evaluation state for multitasking - optimization

Execution Model (Host ↔ Workers)

  • The host process (JS/TS) is responsible for building expressions in the arena (via allocTerminal/allocCons).
  • The host then submits work to workers using the SQ/CQ rings.
  • Workers run an exported infinite loop (workerLoop) that:
    • blocks waiting for SQ entries (memory.atomic.wait32)
    • reduces the submitted root using iterative algorithm with gas limits
    • yields control via Suspensions when gas exhausted (preemptive multitasking)
    • enqueues a CQ completion when reduction finishes or yields

Correlation IDs

Work is correlated using a 32-bit req_id chosen by the host.

  • SQ entry: { node_id: u32, req_id: u32 }
  • CQ entry: { node_id: u32, req_id: u32 }

Completions may arrive out-of-order; req_id is how the host matches results to callers.

Thread Safety & Concurrency

Hash-Consing Under Concurrency

allocCons implements optimistic concurrency control:

  1. Optimistic read: Traverse bucket chain without locking
  2. Pre-allocate ID: Use top.fetch_add() to reserve slot
  3. CAS insertion: Attempt to insert into hash table bucket
  4. Graceful failure: If CAS fails, slot becomes a "hole" (kind = 0)

Correctness guarantee: Canonical node IDs are always returned. Readers treat non-NonTerm kinds as terminal values, so holes don't affect evaluation.

Stop-the-World Resizing

Uses seqlock-style synchronization for arena growth:

  • Odd sequence: Resize in progress (writers spin, readers retry)
  • Array migration: Reverse-order copying prevents overlap issues
  • Bucket rebuild: Hash table reconstructed after resize
  • Even sequence: Stable state (normal operation resumes)

Ring Buffer Concurrency

  • Single producer/consumer: Each ring has dedicated thread roles
  • ABA prevention: Sequence numbers prevent concurrent access issues
  • Memory barriers: Proper acquire/release ordering for visibility
  • Wait/notify: Efficient blocking with WASM atomic operations

Shared Memory Safety

  • Typed array views: JS-side access to arena arrays
  • Cache coherence: 64-byte alignment prevents false sharing
  • Poisoning on failure: Unrecoverable errors set poison sequences
  • Cross-thread visibility: SharedArrayBuffer enables true parallelism

Performance Characteristics

Time Complexity

  • Allocation: O(1) amortized (hash table lookup + potential growth)
  • Hash-consing lookup: O(1) average case (hash table with chaining)
  • Reduction step: O(depth) with iterative algorithm (no stack overflow)
  • Ring operations: O(1) (lock-free with atomic CAS)

Space Complexity

  • Per node: ~16 bytes (4 arrays × 4 bytes + metadata)
  • Hash table: O(capacity) for buckets and collision chains
  • Structural sharing: Significant savings for common subexpressions
  • Memory efficiency: ~60% less memory than naive tree representation

Scalability

  • Expression depth: Unlimited (iterative vs recursive)
  • Parallel workers: Linear scaling with CPU cores
  • Memory growth: Dynamic resizing with seqlock synchronization
  • Concurrent allocations: Lock-free with optimistic concurrency

Memory Layout

The shared arena is laid out as header + fixed rings + SoA arrays.

Header (SabHeader)

  • Magic number, ring parameters, memory offsets
  • Capacity, bucket masks, atomic counters
  • max_steps, resize_seq, top for coordination

Rings (64-byte aligned)

  • SQ (Submission Queue): Host → Worker communication
  • CQ (Completion Queue): Worker → Host results
  • Lock-free with sequence numbers for ABA prevention

SoA Arrays (Struct of Arrays)

  • kind: u8[] - Node type (Terminal/NonTerm/Continuation/Suspension)
  • sym: u8[] - Symbol/mode data
  • left_id: u32[] - Left child pointers
  • `right_id: u32[] - Right child pointers
  • hash32: u32[] - Hash values for deduplication
  • next_idx: u32[] - Hash table collision chains
  • buckets: u32[] - Hash table bucket heads
  • Terminal cache: Fast access to S/K/I nodes

The JS/TS side constructs typed array views over these arrays for fast read-only decoding and direct memory access.

Building

The build is managed by Nix. Cargo.toml is generated by the flake.

nix run .#generate-cargo
cargo build --target wasm32-unknown-unknown --release

WASM Exports

Arena / reducer

  • initArena(capacity: u32) -> u32: allocate and initialize a shared arena region (returns base pointer or error code).
  • connectArena(ptr: u32) -> u32: connect an instance to an existing arena (returns 1 on success).
  • reset()
  • allocTerminal(sym: u32) -> u32
  • allocCons(left: u32, right: u32) -> u32
  • arenaKernelStep(expr: u32) -> u32
  • reduce(expr: u32, max: u32) -> u32

SQ/CQ interface

  • hostSubmit(node_id: u32, req_id: u32, max_steps: u32) -> u32
    • returns 0 ok, 1 SQ full, 2 not connected
    • max_steps is the reduction limit for this specific request (eliminates race conditions)
  • hostPull() -> i64
    • returns -1 if CQ empty, otherwise a packed value:
      • high 32 bits: req_id
      • low 32 bits: node_id (result root)
  • workerLoop() -> !: blocking loop for worker instances (never returns).

Debug

  • debugLockState() -> u32: returns resize_seq & 1 (1 if resizing).
  • getArenaMode() -> u32: 0 heap mode, 1 SAB mode.
  • debugGetArenaBaseAddr() -> u32
  • debugCalculateArenaSize(capacity: u32) -> u32

WASM Imports

  • memory: WebAssembly.Memory (provided via env.memory) - The shared memory instance (must be created with shared: true for SAB mode)
  • js_allow_block() -> i32 - Returns 0 for main thread (spin-only) or 1 for worker threads (can block). This is instance-local and not stored in shared memory to avoid state collisions.

Further Reading

For deeper understanding of the techniques used:

Structure

  • src/lib.rs: crate entry point
  • src/arena.rs: arena allocator + rings + reducer