# RFC: Memory Allocation Strategy for Host Function Returns
## Context
**Composite** is a WebAssembly runtime designed to replace wasmtime's Component Model inside **Theater**. The key motivations are:
1. **Recursive data types**: Theater's message-passing system uses recursive types (S-expressions, ASTs, message trees). The Component Model can't represent these natively, forcing serialization workarounds.
2. **Barrier control**: Theater needs to capture and audit everything that crosses the WASM boundary. The Component Model abstracts this away, making enforcement difficult.
3. **Owned tooling**: Full control over the stack enables better tooling, integration, and developer experience.
Composite uses **Graph ABI** for encoding values (including recursive structures) and **WIT+** for type definitions that allow recursion.
## The Problem
When a host function needs to return data to a guest, it must write that data somewhere in the guest's linear memory. Currently, Composite uses a **fixed offset** (16KB):
```rust
// Current implementation in func_typed, func_async, etc.
let bytes = encode(&output_value);
let out_ptr = 16 * 1024; // Always writes to 16KB!
memory.write(out_ptr, &bytes);
return (out_ptr, bytes.len());
```
### Why This Is Problematic
1. **Nested calls**: If host function A triggers a call back into WASM, which calls host function B, both write to 16KB - B's output overwrites A's.
2. **Size collisions**: Output buffer (16KB-48KB) could collide with guest heap if outputs are large.
3. **Concurrent async**: Multiple async host functions could race to write to the same location.
4. **No coordination**: Guest has no control over where data lands in its memory.
## Theater's Usage Pattern
Theater actors follow a message-handling pattern:
```
┌─────────────┐ ┌─────────────┐
│ Runtime │ │ Actor │
└──────┬──────┘ └──────┬──────┘
│ │
│ "here's a message" (input) │
│───────────────────────────────────>│
│ │
│ actor processes... │
│ │
│ "here's my response" (output) │
│<───────────────────────────────────│
│ │
│ runtime captures response, │
│ moves to next message │
│ │
```
Key characteristics:
- Request/response pattern
- Responses are immediately consumed by the runtime
- Actors don't need to retain response data after returning
- Runtime controls both sides (can define conventions)
## Options
### Option A: Guest Exports Realloc (Component Model Style)
The guest exports a `cabi_realloc` function. When the host needs to return data, it calls back into the guest to allocate memory.
```
┌──────────┐ ┌──────────┐
│ Host │ │ Guest │
└────┬─────┘ └────┬─────┘
│ │
│ (computing result...) │
│ │
│ cabi_realloc(0, 0, 8, 1024) │
│─────────────────────────────────────>│
│ │
│ returns ptr=65536 │
│<─────────────────────────────────────│
│ │
│ writes 1024 bytes @ 65536 │
│ │
│ returns (ptr=65536, len=1024) │
│─────────────────────────────────────>│
```
**Implementation:**
```rust
impl Ctx<'_, T> {
fn allocate_in_guest(&mut self, size: usize) -> Result<i32, Error> {
let realloc = self.get_export("cabi_realloc")?;
realloc.call((0, 0, 8, size as i32))
}
fn write_value(&mut self, value: &Value) -> Result<(i32, i32), Error> {
let bytes = encode(value)?;
let ptr = self.allocate_in_guest(bytes.len())?;
self.memory().write(ptr, &bytes)?;
Ok((ptr, bytes.len() as i32))
}
}
```
**Guest must export:**
```rust
#[no_mangle]
pub extern "C" fn cabi_realloc(
old_ptr: i32, old_size: i32, align: i32, new_size: i32
) -> i32 {
// Standard allocator implementation
}
```
**Pros:**
- Proven pattern (used by Component Model)
- Guest controls its own memory
- Handles arbitrary return sizes
- Clean ownership semantics
- Could leverage existing `wit-bindgen` patterns
**Cons:**
- Every actor needs allocator boilerplate (until tooling automates it)
- Function call overhead for every return
- What if guest allocator fails mid-call?
- More complex than necessary for Theater's use case?
---
### Option B: Caller Provides Output Buffer
The runtime provides both input AND output buffers. The guest writes its response to the provided location.
```
┌──────────┐ ┌──────────┐
│ Runtime │ │ Actor │
└────┬─────┘ └────┬─────┘
│ │
│ call(in_ptr, in_len, │
│ out_ptr, out_cap) │
│─────────────────────────────────────>│
│ │
│ actor writes response to out_ptr │
│ │
│ returns out_len │
│<─────────────────────────────────────│
│ │
│ runtime reads [out_ptr..out_len] │
```
**Calling convention:**
```rust
// Current: (in_ptr, in_len) -> (out_ptr, out_len) packed as i64
// New: (in_ptr, in_len, out_ptr, out_cap) -> out_len
```
**Implementation:**
```rust
impl Instance<T> {
fn call_with_value(&mut self, name: &str, input: &Value) -> Result<Value, Error> {
// Write input
let in_len = self.write_value(INPUT_OFFSET, input)?;
// Provide output buffer (runtime-managed region)
let out_ptr = OUTPUT_BUFFER_OFFSET;
let out_cap = OUTPUT_BUFFER_SIZE; // e.g., 1MB
let out_len = self.call_func(name, (
INPUT_OFFSET, in_len,
out_ptr, out_cap
))?;
self.read_value(out_ptr, out_len)
}
}
```
**Pros:**
- No guest allocator needed at all
- Zero callback overhead
- Runtime has full control over memory
- Simpler actors (no boilerplate)
- Perfect fit for request/response pattern
- Easy to capture/audit all data crossing boundary
**Cons:**
- What if response exceeds buffer capacity?
- Option: Return error, runtime retries with larger buffer
- Option: Two-phase (first call returns size, second call writes)
- Changes calling convention (breaking change)
- Less flexible than guest-controlled allocation
---
### Option C: Runtime-Managed Buffer Pool
The runtime maintains a pool of buffers, assigning one per call context.
```rust
struct CallContext {
id: u64,
output_buffer: Vec<u8>,
output_offset: usize, // Location in linear memory
}
impl Runtime {
fn begin_call(&mut self) -> CallContext {
// Allocate or reuse a buffer
// Map it into guest memory at a unique offset
}
fn end_call(&mut self, ctx: CallContext) {
// Return buffer to pool
}
}
```
**Pros:**
- Handles nested calls (each gets own buffer)
- No guest boilerplate
- Efficient buffer reuse
**Cons:**
- Complex runtime implementation
- Memory mapping complexity
- Buffer lifecycle management
---
### Option D: Hybrid Approach
Use caller-provides-buffer as default, with fallback to guest realloc for oversized responses.
```rust
fn write_value(&mut self, value: &Value) -> Result<(i32, i32), Error> {
let bytes = encode(value)?;
if bytes.len() <= INLINE_BUFFER_SIZE {
// Fast path: use pre-allocated buffer
self.memory().write(INLINE_BUFFER_OFFSET, &bytes)?;
Ok((INLINE_BUFFER_OFFSET, bytes.len()))
} else {
// Slow path: ask guest to allocate
let ptr = self.allocate_in_guest(bytes.len())?;
self.memory().write(ptr, &bytes)?;
Ok((ptr, bytes.len()))
}
}
```
**Pros:**
- Fast path for common case (small responses)
- Handles large responses gracefully
- Flexible
**Cons:**
- Two code paths to maintain
- Guest still needs realloc export (for large responses)
---
## Questions for Discussion
1. **How large are typical Theater messages?** If 99% fit in 1MB, Option B might be sufficient with a simple error for oversized responses.
2. **Are nested host calls common?** If actor A calls host, which calls back to actor A, which calls host again - how deep can this go?
3. **Is the calling convention change acceptable?** Option B changes from `(in_ptr, in_len) -> i64` to `(in_ptr, in_len, out_ptr, out_cap) -> i32`.
4. **How important is zero guest boilerplate?** Option A requires every actor to export `cabi_realloc`. Is that acceptable, or is simplicity paramount?
5. **Should Composite be Theater-specific or general?** A general-purpose runtime might favor Option A (Component Model style). A Theater-specific runtime might favor Option B (simpler, fits the use case).
## Decision: Option B Implemented
After gathering feedback, **Option B (caller provides output buffer)** was chosen and implemented. The new calling convention is:
```
Old: (in_ptr, in_len) -> i64 // packed (out_ptr, out_len)
New: (in_ptr, in_len, out_ptr, out_cap) -> i32 // returns out_len, or -1 on error
```
Key benefits:
- Zero boilerplate for actors
- Runtime controls the barrier completely
- Fits the consumption pattern (responses are immediately read by runtime)
- No guest allocator needed
- Clean error handling (-1 indicates failure)
---
## Appendix: Memory Layout (Implemented)
```
WASM Linear Memory:
┌─────────────────────────────────────────────┐
│ 0-16KB Default input buffer │
├─────────────────────────────────────────────┤
│ 16-48KB Default output buffer (32KB) │ ← Now caller-provided
├─────────────────────────────────────────────┤
│ 48KB+ Bump allocator (host.alloc) │
└─────────────────────────────────────────────┘
Constants defined in src/runtime/host.rs:
- INPUT_BUFFER_OFFSET = 0
- OUTPUT_BUFFER_OFFSET = 16 * 1024
- OUTPUT_BUFFER_CAPACITY = 32 * 1024
```
## Appendix: Relevant Code
- `src/runtime/host.rs` - `Ctx::write_value()`, `func_typed()`, `func_async()`
- `src/runtime/mod.rs` - `Instance::call_with_value()`
- Fixed offset: `let out_ptr = 16 * 1024;`