Skip to main content

Module string_intern

Module string_intern 

Source
Expand description

Small-string interning (Phase D.4).

Extracted from value_word.rs in Phase R6.3.

§Design rationale

True small-string optimization (SSO) in the sense of “pack the bytes inline in the 8-byte ValueWord” is not feasible in the current layout: all 8 NaN-boxing tag values (0b000..0b111) are already consumed (see tag_bits) and only 48 bits of payload are available, which is too few bytes to be useful (strings <= 6 bytes is a rounding error).

Multi-slot SSO (spreading bytes across 2-3 adjacent stack slots) would require compiler support for multi-slot string bindings and invasive changes to the executor + JIT — not worth it as an isolated change.

Instead, we collapse the common case of repeated short strings via a process-global intern pool. Programs allocate Arc<String> over and over for the same content (field names, enum tags, short literals like “ok”, “id”, “name”). With interning, N copies share a single allocation and the Arc refcount does the rest.

§Behavioural contract

  • ValueWord::from_string(s) still returns a ValueWord wrapping Arc<String>. Callers observe no change: as_string() / as_heap_ref() return the same &str content. Mutation is already impossible via Arc<String> (no Arc::make_mut is called on interned strings in the codebase — all string ops produce a new String).
  • Long strings (len > INTERN_THRESHOLD) bypass the pool entirely: the hash/lookup cost isn’t justified for long unique payloads, and the memory win would be marginal.
  • The pool is bounded by INTERN_CAP entries. When full, new lookups fall through to the no-intern path — we never evict, keeping all live Arc<String> refs valid.
  • The pool uses std::sync::LazyLock<Mutex<...>>. A HashMap<Arc<String>, ()> (set semantics keyed by the Arc’s string content) would work, but using HashMap<Arc<String>, Arc<String>> lets us return the canonical Arc without rebuilding one.

§Future work

A fully-inline SSO (store up to ~22 bytes inline across a 24-byte heap object with its own refcount) would eliminate the outer Arc allocation entirely for short strings. That’s a bigger change — it touches the HeapValue representation, VM executor string reads, JIT FFI, and wire serialization. Revisit once the StringObj / UnifiedString v2 paths are the primary runtime representation.

Constants§

INTERN_CAP
Hard cap on pool size. When reached, new strings bypass interning. Sized to comfortably fit all stdlib field names + enum tags + common literals across a large program. Entries are never evicted once inserted (the pool owns an Arc ref keeping the string alive).
INTERN_THRESHOLD
Strings with byte length <= this value are candidates for interning. Chosen to cover common field names, enum tags, and short literals (e.g. “ok”, “err”, “id”, “name”, “type”, “value”) while excluding long user content where the hash cost dominates.

Functions§

intern_short_string
Return the canonical Arc<String> for s if s is short enough to intern; otherwise return s unchanged. Callers should always use the returned Arc — it may be a different (shared) pointer than the input.