Expand description
Small-string interning (Phase D.4).
Extracted from value_word.rs in Phase R6.3.
§Design rationale
True small-string optimization (SSO) in the sense of “pack the bytes inline
in the 8-byte ValueWord” is not feasible in the current layout: all 8
NaN-boxing tag values (0b000..0b111) are already consumed (see tag_bits)
and only 48 bits of payload are available, which is too few bytes to be
useful (strings <= 6 bytes is a rounding error).
Multi-slot SSO (spreading bytes across 2-3 adjacent stack slots) would require compiler support for multi-slot string bindings and invasive changes to the executor + JIT — not worth it as an isolated change.
Instead, we collapse the common case of repeated short strings via
a process-global intern pool. Programs allocate Arc<String> over and
over for the same content (field names, enum tags, short literals like
“ok”, “id”, “name”). With interning, N copies share a single allocation
and the Arc refcount does the rest.
§Behavioural contract
ValueWord::from_string(s)still returns aValueWordwrappingArc<String>. Callers observe no change:as_string()/as_heap_ref()return the same&strcontent. Mutation is already impossible viaArc<String>(noArc::make_mutis called on interned strings in the codebase — all string ops produce a newString).- Long strings (len >
INTERN_THRESHOLD) bypass the pool entirely: the hash/lookup cost isn’t justified for long unique payloads, and the memory win would be marginal. - The pool is bounded by
INTERN_CAPentries. When full, new lookups fall through to the no-intern path — we never evict, keeping all liveArc<String>refs valid. - The pool uses
std::sync::LazyLock<Mutex<...>>. AHashMap<Arc<String>, ()>(set semantics keyed by the Arc’s string content) would work, but usingHashMap<Arc<String>, Arc<String>>lets us return the canonical Arc without rebuilding one.
§Future work
A fully-inline SSO (store up to ~22 bytes inline across a 24-byte heap
object with its own refcount) would eliminate the outer Arc allocation
entirely for short strings. That’s a bigger change — it touches the
HeapValue representation, VM executor string reads, JIT FFI, and wire
serialization. Revisit once the StringObj / UnifiedString v2 paths
are the primary runtime representation.
Constants§
- INTERN_
CAP - Hard cap on pool size. When reached, new strings bypass interning. Sized to comfortably fit all stdlib field names + enum tags + common literals across a large program. Entries are never evicted once inserted (the pool owns an Arc ref keeping the string alive).
- INTERN_
THRESHOLD - Strings with byte length <= this value are candidates for interning. Chosen to cover common field names, enum tags, and short literals (e.g. “ok”, “err”, “id”, “name”, “type”, “value”) while excluding long user content where the hash cost dominates.
Functions§
- intern_
short_ string - Return the canonical
Arc<String>forsifsis short enough to intern; otherwise returnsunchanged. Callers should always use the returned Arc — it may be a different (shared) pointer than the input.