typed-ski Rust (WASM arena)
This directory contains the Rust implementation of the arena allocator and reducer used by typed-ski's WASM evaluators.
Overview
This Rust code targets wasm32-unknown-unknown and runs inside WebAssembly.
It provides:
- Hash-consing arena allocation with structural sharing for memory efficiency
- Iterative reduction algorithm to prevent stack overflow on deep expressions
- Preemptive multitasking with cooperative yielding for fair worker scheduling
- Lock-free ring buffers (io_uring-style) for efficient inter-thread communication
- SharedArrayBuffer support for true parallel evaluation across Web Workers
- Dynamic arena growth with seqlock-based synchronization
Architecture
The arena allocator supports two modes:
- Single-threaded (Heap Mode): Uses local WASM memory with lazy allocation
- Multi-threaded (SAB Mode): Uses WebAssembly shared memory (SharedArrayBuffer) to enable concurrent access from multiple Web Workers, with atomic operations and lock-free synchronization for thread safety
Hash consing is implemented using a hash table with chaining, ensuring that expressions with the same structure (same left and right children) are automatically deduplicated. The arena can grow dynamically up to a fixed maximum capacity.
Thread Safety and Locking
In SAB (SharedArrayBuffer) mode, the arena uses lock-free and atomic-based concurrency control to enable high-performance concurrent access:
Optimistic Concurrency Control
Hash-consing allocation uses optimistic concurrency control:
- Optimistic read: Traverse bucket chain without locking
- Pre-allocate ID: Use
top.fetch_add()to reserve slot atomically - CAS insertion: Attempt to insert into hash table bucket
- Graceful failure: If CAS fails, slot becomes a "hole" (
kind = 0)
Correctness guarantee: Canonical node IDs are always returned. Readers treat
non-NonTerm kinds as terminal values, so holes don't affect evaluation.
Lock-Free Ring Buffers
Inter-thread communication uses lock-free ring buffers (io_uring-style):
- Submission Queue (SQ): Main thread → Worker communication
- Completion Queue (CQ): Worker → Main thread results
- Atomic operations: Wait-free producer/consumer patterns
- Sequence numbers: ABA prevention in concurrent access
- Blocking waits: Efficient WASM atomic wait/notify
Resize Synchronization
Arena growth uses a seqlock-style approach:
- Stop-the-world pauses: All threads spin during resize operations
- Sequence lock: Odd values indicate resize in progress, even values indicate stable
- Reverse-order copying: Prevents overlap issues during memory migration
- Bucket rebuild: Hash table reconstructed after resize
- Poisoning on failure: Unrecoverable errors set poison sequences
Key Optimizations
1. Hash-Consing (Structural Sharing)
The arena uses hash consing to deduplicate identical subexpressions:
- Avalanche hashing: Fast, high-quality hash function for node pairs
- Separate chaining: Collision resolution in hash table buckets
- Memory efficiency: DAG representation instead of tree (common subexpressions shared)
- O(1) lookups: Fast deduplication during expression construction
2. Iterative Reduction Algorithm
Avoids recursion stack overflow on deep expressions:
- Explicit stack management: Uses Continuation nodes instead of call stack
- Two-phase reduction: Left child first, then right child
- Memory recycling: Dead continuation frames reused immediately
- Scalable depth: No theoretical limit on expression depth
3. Preemptive Multitasking
Enables fair scheduling in parallel evaluation:
- Gas-based yielding: Workers yield control when traversal budget exhausted
- Suspension nodes: Capture complete evaluation state for resumption
- Cooperative scheduling: Prevents worker starvation
- State serialization: Evaluation context stored directly in arena
4. Lock-Free Ring Buffers
High-performance inter-thread communication:
- io_uring-style design: Submission and completion queues
- Atomic operations: Wait-free producer/consumer patterns
- Sequence numbers: ABA prevention in concurrent access
- Blocking waits: Efficient WASM atomic wait/notify
Arena Node Types
The arena supports four node types for different purposes:
Terminal: SKI combinators (S, K, I) - leaf nodesNonTerm: Function application(left right)- core expressionsContinuation: Stack frames for iterative reduction - optimizationSuspension: Paused evaluation state for multitasking - optimization
Execution Model (Host ↔ Workers)
- The host process (JS/TS) is responsible for building expressions in the
arena (via
allocTerminal/allocCons). - The host then submits work to workers using the SQ/CQ rings.
- Workers run an exported infinite loop (
workerLoop) that:- blocks waiting for SQ entries (
memory.atomic.wait32) - reduces the submitted root using iterative algorithm with gas limits
- yields control via Suspensions when gas exhausted (preemptive multitasking)
- enqueues a CQ completion when reduction finishes or yields
- blocks waiting for SQ entries (
Correlation IDs
Work is correlated using a 32-bit req_id chosen by the host.
- SQ entry:
{ node_id: u32, req_id: u32 } - CQ entry:
{ node_id: u32, req_id: u32 }
Completions may arrive out-of-order; req_id is how the host matches results to
callers.
Thread Safety & Concurrency
Hash-Consing Under Concurrency
allocCons implements optimistic concurrency control:
- Optimistic read: Traverse bucket chain without locking
- Pre-allocate ID: Use
top.fetch_add()to reserve slot - CAS insertion: Attempt to insert into hash table bucket
- Graceful failure: If CAS fails, slot becomes a "hole" (
kind = 0)
Correctness guarantee: Canonical node IDs are always returned. Readers treat
non-NonTerm kinds as terminal values, so holes don't affect evaluation.
Stop-the-World Resizing
Uses seqlock-style synchronization for arena growth:
- Odd sequence: Resize in progress (writers spin, readers retry)
- Array migration: Reverse-order copying prevents overlap issues
- Bucket rebuild: Hash table reconstructed after resize
- Even sequence: Stable state (normal operation resumes)
Ring Buffer Concurrency
- Single producer/consumer: Each ring has dedicated thread roles
- ABA prevention: Sequence numbers prevent concurrent access issues
- Memory barriers: Proper acquire/release ordering for visibility
- Wait/notify: Efficient blocking with WASM atomic operations
Shared Memory Safety
- Typed array views: JS-side access to arena arrays
- Cache coherence: 64-byte alignment prevents false sharing
- Poisoning on failure: Unrecoverable errors set poison sequences
- Cross-thread visibility: SharedArrayBuffer enables true parallelism
Performance Characteristics
Time Complexity
- Allocation: O(1) amortized (hash table lookup + potential growth)
- Hash-consing lookup: O(1) average case (hash table with chaining)
- Reduction step: O(depth) with iterative algorithm (no stack overflow)
- Ring operations: O(1) (lock-free with atomic CAS)
Space Complexity
- Per node: ~16 bytes (4 arrays × 4 bytes + metadata)
- Hash table: O(capacity) for buckets and collision chains
- Structural sharing: Significant savings for common subexpressions
- Memory efficiency: ~60% less memory than naive tree representation
Scalability
- Expression depth: Unlimited (iterative vs recursive)
- Parallel workers: Linear scaling with CPU cores
- Memory growth: Dynamic resizing with seqlock synchronization
- Concurrent allocations: Lock-free with optimistic concurrency
Memory Layout
The shared arena is laid out as header + fixed rings + SoA arrays.
Header (SabHeader)
- Magic number, ring parameters, memory offsets
- Capacity, bucket masks, atomic counters
max_steps,resize_seq,topfor coordination
Rings (64-byte aligned)
- SQ (Submission Queue): Host → Worker communication
- CQ (Completion Queue): Worker → Host results
- Lock-free with sequence numbers for ABA prevention
SoA Arrays (Struct of Arrays)
kind: u8[]- Node type (Terminal/NonTerm/Continuation/Suspension)sym: u8[]- Symbol/mode dataleft_id: u32[]- Left child pointers- `right_id: u32[] - Right child pointers
hash32: u32[]- Hash values for deduplicationnext_idx: u32[]- Hash table collision chainsbuckets: u32[]- Hash table bucket heads- Terminal cache: Fast access to S/K/I nodes
The JS/TS side constructs typed array views over these arrays for fast read-only decoding and direct memory access.
Building
The build is managed by Nix. Cargo.toml is generated by the flake.
WASM Exports
Arena / reducer
initArena(capacity: u32) -> u32: allocate and initialize a shared arena region (returns base pointer or error code).connectArena(ptr: u32) -> u32: connect an instance to an existing arena (returns 1 on success).reset()allocTerminal(sym: u32) -> u32allocCons(left: u32, right: u32) -> u32arenaKernelStep(expr: u32) -> u32reduce(expr: u32, max: u32) -> u32
SQ/CQ interface
hostSubmit(node_id: u32, req_id: u32, max_steps: u32) -> u32- returns
0ok,1SQ full,2not connected max_stepsis the reduction limit for this specific request (eliminates race conditions)
- returns
hostPull() -> i64- returns
-1if CQ empty, otherwise a packed value:- high 32 bits:
req_id - low 32 bits:
node_id(result root)
- high 32 bits:
- returns
workerLoop() -> !: blocking loop for worker instances (never returns).
Debug
debugLockState() -> u32: returnsresize_seq & 1(1 if resizing).getArenaMode() -> u32:0heap mode,1SAB mode.debugGetArenaBaseAddr() -> u32debugCalculateArenaSize(capacity: u32) -> u32
WASM Imports
memory: WebAssembly.Memory(provided viaenv.memory) - The shared memory instance (must be created withshared: truefor SAB mode)js_allow_block() -> i32- Returns0for main thread (spin-only) or1for worker threads (can block). This is instance-local and not stored in shared memory to avoid state collisions.
Further Reading
For deeper understanding of the techniques used:
- Hash Consing: Memory-efficient structural sharing
- Seqlock: Reader-writer synchronization
- Compare-and-Swap: Atomic operation primitive
- Memory Barrier: Memory ordering guarantees
- io_uring: Asynchronous I/O framework
- SharedArrayBuffer: Cross-thread memory sharing
- WebAssembly Atomics: Atomic operations in WASM
Structure
src/lib.rs: crate entry pointsrc/arena.rs: arena allocator + rings + reducer