Skip to main content

Module encoder

Module encoder 

Source
Expand description

Symbolic encoder – Rust port of research/f2_selector_oracle.py::encode_symbolic.

Implements the Encoder trait. The selector currently returns Format::Symbolic for every input that passes a minimum-length floor; richer dispatch (JIT-progressive, fragment-prose, structured-delim) is tracked separately as F1-followup. Anything below the floor falls through as Format::Prose.

The substitution table and filler-word set are kept in sync with the Python reference. Whenever you change one, port the change to the other so research benchmarks and production stay consistent.

Performance contract (DoD §10): compress must run in <5ms p95. Latency budget is dominated by Measurer::tokenize (the regex pass itself is ~50us on a 4kB input – see tests::compress_meets_section_10).

Structs§

EncoderTrace
Per-rule fire counts emitted alongside an encode pass.
RuleSet
Hot-reloadable rule configuration. Carries an enable flag and a soft weight in [0.0, 1.0] for each of the 8 categorical rules the encoder knows about. weights is stored for the closed-loop bandit / experiment generator; the encoder itself treats weight < ENABLE_WEIGHT_THRESHOLD as “off” and otherwise consults enabled.
SymbolicEncoder
Production encoder backed by a Measurer for token accounting.

Constants§

ENABLE_WEIGHT_THRESHOLD
Soft-gate threshold below which a rule’s sampled weight flips it off. Originally 0.5 (midpoint of the Beta prior), revised to 0.05 on 2026-04-24 after a corpus-distribution simulation over 126 audit events showed 0 of 15 observed rules could cross 0.5 under any attribution scheme — the bounded reward signal (p90=0.15, median=0.01 on this corpus) makes a 0.5 threshold semantically “rule must save 50% of bytes by itself to stay enabled”, which no individual rule can achieve. 0.05 discriminates productive rules (mean ≥ 0.09) from genuinely-dead rules (Beta converges to ≈ 0). Must stay strictly greater than pithy_controller::bandit::ZERO_FIRE_MAX_WEIGHT so the zero-fire clamp still disables rules that have never observed evidence.
MAX_INPUT_CHARS
Maximum input length (chars) the encoder will process. Inputs above this fall through to Prose with OversizedInput so a pathological caller cannot pin a worker on regex work.
MIN_INPUT_CHARS
Minimum input length (chars) below which encoder bypasses to Prose.
RULE_NAMES
Stable rule names emitted by EncoderTrace::as_pairs. Used as the canonical keys for RuleSet::enabled / weights so the closed-loop tuner attributes savings to the same identifier the encoder fires under.

Functions§

encode_symbolic
Run the symbolic-encoding pipeline. Pure function; deterministic.
encode_symbolic_structural_traced_with
Segment-aware symbolic compression (Phase B of B8 fix).
encode_symbolic_traced
Run the symbolic-encoding pipeline AND return a per-rule firing trace.
encode_symbolic_traced_with
Run the symbolic-encoding pipeline against text under the supplied RuleSet, returning the compressed text plus a per-rule firing trace.