arena-terms 0.4.0

# Arena-Based vs. Direct Enum Term Representations in Rust

## Introduction  

When building systems that manipulate **tree-shaped data structures**—such as logic programming terms, symbolic expressions, or abstract syntax trees—the choice of representation has major consequences for performance and ergonomics. Two common strategies in Rust are **arena-based handles** and **direct enum variants**. Arena-based handles emphasize compact storage, implicit sharing, and strong memory locality, while direct enums provide idiomatic Rust-style pattern matching and simplicity.  

A key insight is that **arena-based APIs decouple storage layout from the matching API**. With arenas, you can expose precise and clean shapes like `View::Func(&str, &[Term])` regardless of how the data is stored internally. Direct enums, by contrast, tightly couple **pattern matching with memory representation**: to simplify storage, enums often collapse shapes (e.g., `Func(Vec<Term>)`), which can allow imprecise or nonsensical cases while making the API less expressive.  

In an arena-based design, all term data is **immutable** and stored in shared buffers, so multiple handles can point to the same structure without duplication. This makes sharing and reusing terms cheap—copying a term is just copying a small handle. In contrast, with a direct enum representation, each `Term` **owns its data** (`String`, `Vec`, etc.), which means creating variations, or transforming terms usually involves **deep copying** or destructuring and rebuilding the term tree. As a result, arenas favor efficient reuse and sharing, while direct enums trade simplicity for heavier cloning costs in real workloads.

## Memory Efficiency and Allocation

### Size and Layout
- **Arena-Based:** Each `Term` is a compact fixed-size handle (~16 bytes on 64-bit) storing a tag and an index/pointer into an arena. Thanks to Rust’s **niche optimization**, `Option<Term>` has the same size as `Term`.  
- **Direct Enum:** Each variant (`Atom(String)`, `List(Vec<Term>)`, etc.) must fit the largest payload plus a discriminant, often 24–32 bytes or more. `Option<Term>` may be larger than `Term` itself, unless layout niches happen to align.

### Heap Allocation and Reuse
- **Arena:** Variable-sized data is interned once in contiguous buffers owned by the arena. Large payloads (long strings, lists) can be **shared** across many handles. Copying a handle is O(1).  
- **Direct Enum:** Each variant stores its own heap allocation (`Vec`, `String`, `Box`), and reusing data requires deep cloning or explicit sharing (`Rc`/`Arc`).

### Fragmentation and Locality
- **Arena:** Bump allocation clusters related data, improving **cache locality**. Traversals over lists (`&[Term]`) are efficient.  
- **Direct Enum:** Allocations are scattered; traversals involve more pointer chasing and cache misses.

### Reclamation
- **Arena:** Memory can be freed/reset in bulk, matching Prolog-like workloads.  
- **Direct Enum:** Rust drops objects individually, offering finer control but more overhead.

**Summary Table**

| Aspect                 | Arena-Based Handle         | Direct Enum Term                      |
| ---------------------- | -------------------------- | ------------------------------------- |
| Size of `Term`         | ~16 bytes, fixed           | ≥24 bytes, depends on variant         |
| Size of `Option<Term>` | Same as `Term`             | Often larger                          |
| Allocation pattern     | Few, amortized bulk allocs | Many small allocations                |
| Data reuse             | Implicit, O(1)             | Needs deep clone or `Rc`/`Arc`        |
| Locality               | High, contiguous           | Lower, scattered                      |
| Reclamation            | Bulk free/reset            | Per-object drop                       |

## Performance Considerations

### Term Creation
- **Arena:** Constructing terms is a pointer bump + copy. Large compounds (lists, functions) are built with minimal allocations.  
- **Direct Enum:** Each nested payload (`Vec`, `String`) incurs separate allocations, unless pre-sized manually.

### Access and Traversal
- **Arena:** Access requires `term.view(&arena)` → returns a `View` enum. There’s a small cost (arena check + deref), but traversal benefits from contiguous slices.  
- **Direct Enum:** Direct `match` syntax is simple and idiomatic, but traversals suffer from poor locality on large/deep data and less precise pattern-matching shape.

### Reuse and Copying
- **Arena:** Handles are `Copy`; reusing structures is O(1).  
- **Direct Enum:** Cloning is deep by default. Adding `Rc`/`Arc` helps but adds refcount overhead.

### Cleanup
- **Arena:** Drop/reset wipes large amounts of data quickly.  
- **Direct Enum:** Rust’s granular drop provides flexibility at the cost of efficiency for bulk deletions.

## Serialization/Deserialization

* **Arena:** Enables fast and efficient serialization/deserialization, especially for binary formats.
* **Direct Enum:** Results in much slower serialization/deserialization.

## Ergonomics and API Design  

### Pattern Matching
- **Direct Enum:** Straightforward and idiomatic, but constrained by the representation:

```rust
match term {
    Term::Atom(name) => { /* ... */ }
    Term::List(elems) => { /* ... */ }
    Term::Func(elems) => {
        for elem in elems {
            // recurse with elem
        }
    }
    _ => { /* ... */ }
}
```

- **Arena:** Requires a `View` via the arena:

```rust
match term.view(&arena) {
    View::Atom(name) => { /* ... */ }
    View::List(arena, elems, tail) => { /* ... */ }
    View::Func(arena, functor, args) => {
        for arg in args {
            // recurse with arg.view(arena)
        }
    }
    _ => { /* ... */ }
}
```

⚡ **Key Insight:** View adds ceremony but also **decouples storage layout from API design**. The `View` abstraction makes it possible to present a more precise API—such as `View::Func(&str, &[Term])`—without being bound to low-level encoding details.  

### Construction

* **Arena:** Requires passing the arena around; however, builder macros (`atom!`, `list!`, `func!`) and `IntoTerm` conversions make term construction ergonomic.
* **Direct Enum:** Straightforward but verbose, also requiring helper traits, functions, and macros.

### Ownership and Borrowing
- **Arena:** `Term` handles are `Copy` and cheap to store or pass.  
- **Direct Enum:** Requires `ref` patterns or borrowing fields to avoid moves.

### Safety and Isolation
- **Arena:** Prone to mixing terms from different arenas (caught at runtime).  
- **Direct Enum:** Simpler, all data is self-contained.

## Conclusion  

Arena-based handles are best for workloads with many terms, heavy reuse, or deeply nested structures where performance and memory efficiency are critical. They offer small fixed-size handles, efficient reuse, cache-friendly locality, bulk memory management, and more precise pattern-matching API though they require carrying the arena reference.  

Direct enums remain idiomatic and simple, ideal for small projects, teaching, or prototypes. However, they tie **API expressiveness to storage layout**, leading to inefficiencies and less precise pattern matching. At scale, they incur scattered allocations and costly deep clones.

You can find a reference arena-based, Prolog-like term implementation—featuring a term parser with dynamic operators and a pretty-printer—at [https://github.com/ikhomyakov/arena-terms.git](https://github.com/ikhomyakov/arena-terms.git).