Expand description
Symbolic encoder – Rust port of research/f2_selector_oracle.py::encode_symbolic.
Implements the Encoder trait. The selector currently returns
Format::Symbolic for every input that passes a minimum-length floor;
richer dispatch (JIT-progressive, fragment-prose, structured-delim) is
tracked separately as F1-followup. Anything below the floor falls
through as Format::Prose.
The substitution table and filler-word set are kept in sync with the Python reference. Whenever you change one, port the change to the other so research benchmarks and production stay consistent.
Performance contract (DoD §10): compress must run in <5ms p95.
Latency budget is dominated by Measurer::tokenize (the regex pass
itself is ~50us on a 4kB input – see tests::compress_meets_section_10).
Structs§
- Encoder
Trace - Per-rule fire counts emitted alongside an encode pass.
- RuleSet
- Hot-reloadable rule configuration. Carries an enable flag and a
soft weight in
[0.0, 1.0]for each of the 8 categorical rules the encoder knows about.weightsis stored for the closed-loop bandit / experiment generator; the encoder itself treatsweight < ENABLE_WEIGHT_THRESHOLDas “off” and otherwise consultsenabled. - Symbolic
Encoder - Production encoder backed by a
Measurerfor token accounting.
Constants§
- ENABLE_
WEIGHT_ THRESHOLD - Soft-gate threshold below which a rule’s sampled weight flips it
off. Originally 0.5 (midpoint of the Beta prior), revised to 0.05
on 2026-04-24 after a corpus-distribution simulation over 126
audit events showed 0 of 15 observed rules could cross 0.5 under
any attribution scheme — the bounded reward signal (p90=0.15,
median=0.01 on this corpus) makes a 0.5 threshold semantically
“rule must save 50% of bytes by itself to stay enabled”, which no
individual rule can achieve. 0.05 discriminates productive rules
(mean ≥ 0.09) from genuinely-dead rules (Beta converges to ≈ 0).
Must stay strictly greater than
pithy_controller::bandit::ZERO_FIRE_MAX_WEIGHTso the zero-fire clamp still disables rules that have never observed evidence. - MAX_
INPUT_ CHARS - Maximum input length (chars) the encoder will process. Inputs above
this fall through to Prose with
OversizedInputso a pathological caller cannot pin a worker on regex work. - MIN_
INPUT_ CHARS - Minimum input length (chars) below which encoder bypasses to Prose.
- RULE_
NAMES - Stable rule names emitted by
EncoderTrace::as_pairs. Used as the canonical keys forRuleSet::enabled/weightsso the closed-loop tuner attributes savings to the same identifier the encoder fires under.
Functions§
- encode_
symbolic - Run the symbolic-encoding pipeline. Pure function; deterministic.
- encode_
symbolic_ structural_ traced_ with - Segment-aware symbolic compression (Phase B of B8 fix).
- encode_
symbolic_ traced - Run the symbolic-encoding pipeline AND return a per-rule firing trace.
- encode_
symbolic_ traced_ with - Run the symbolic-encoding pipeline against
textunder the suppliedRuleSet, returning the compressed text plus a per-rule firing trace.