Skip to main content

Crate muxer

Crate muxer 

Source
Expand description

muxer: deterministic, multi-objective routing primitives for piecewise-stationary multi-armed bandit problems.

Given K arms (model versions, inference endpoints, backends, or any discrete action set selected repeatedly), the agent observes vector-valued outcomes per call and selects the next arm. Reward distributions are piecewise-stationary: they may change at unknown times, and the agent must detect and adapt to these changes.

An Outcome carries four caller-defined quality fields:

  • ok: the call produced a usable result.
  • junk: quality was below your threshold (hard_junk=true implies junk=true).
  • hard_junk: the call failed entirely — a harsher subset of junk, penalized separately.
  • quality_score: Option<f64>: optional continuous quality signal [0,1] (higher = better). Supplements the binary flags without changing routing semantics.

Plus cost_units (caller-defined cost proxy) and elapsed_ms (wall-clock time).

Goals:

  • Deterministic by default: same stats + config → same choice.
  • Non-stationarity friendly: sliding-window summaries, not lifetime averages.
  • Multi-objective: Pareto frontier over ok/junk/cost/latency, then scalarization.
  • Small K: designed for 2–10 arms; not intended for K in the hundreds.

Selection policies:

  • select_mab / select_mab_explain / select_mab_monitored_explain: deterministic Pareto + scalarization. When MabConfig::quality_weight > 0, mean_quality_score is a separate Pareto dimension (6th/9th); scalarization adds quality_weight × mean_quality_score explicitly.
  • ThompsonSampling: seedable Thompson sampling for scalar rewards in [0, 1].
  • Exp3Ix: seedable EXP3-IX for adversarial / fast-shifting rewards.
  • BanditPolicy (feature stochastic): common decide/update_reward trait for ThompsonSampling and Exp3Ix — enables generic policy code.
  • softmax_map: stable score → probability helper for traffic splitting.
  • (feature contextual) LinUcb: linear contextual bandit.

Operational primitives:

Non-goals:

  • Not a full bandit platform (no storage, OPE pipelines, dashboards).
  • contextual is intentionally a small, pragmatic policy module.

§The Three Objectives and the Objective Manifold

Every routing decision simultaneously serves three purposes:

  1. Exploitation (regret minimization): route to the best arm now.
  2. Estimation (learning): understand how each arm performs across conditions.
  3. Detection/triage (monitoring): notice when an arm breaks, then investigate.

These map to the modules of this crate:

  • Selection (select_mab, Exp3Ix, ThompsonSampling): objectives 1+2.
  • Monitoring (monitor): objective 3 – drift/catKL/CUSUM are the detection surface of the design measure.
  • Triage (worst_first): active investigation after detection fires – prioritize historically broken arms to characterize the change.

§The non-contextual collapse (static schedules)

For fixed (non-adaptive) allocation schedules with K arms and Gaussian rewards, estimation error (MSE ~ 1/n) and average detection delay (D_avg ~ C*T / (n * delta^2)) are both monotone-decreasing in the suboptimal-arm pull count n, with proportional gradients everywhere:

  D_avg = (2 * C * sigma^2 * ln(1/alpha) / delta^2) * MSE

This proportionality is structural, not approximate: both functionals care only about “how many observations at this cell”, and their sensitivity functions are scalar multiples of each other. The three-way Pareto surface collapses to a one-dimensional curve parameterized by n. This yields the product identity R_T * D_avg = C * Delta * T / delta^2 for the restricted class of uniform schedules.

Categorical note. The formula above uses Gaussian notation (sigma^2, mean shift delta). muxer operates on categorical outcomes (ok/junk/hard_junk), where detection delay scales as h / KL(p1 || p0) in sample time rather than 2*b*sigma^2 / delta^2. The proportionality structure — MSE and average detection delay both O(1/n_k) — holds identically; only the constants change. See pare::sensitivity for the general form.

Caveat: adaptive policies. This clean proportionality holds exactly only for static (non-adaptive) schedules. Adaptive policies (UCB, Thompson Sampling) break it in at least two ways: (1) they allocate the suboptimal arm in bursts during exploration phases rather than uniformly, worsening worst-case detection delay relative to uniform spacing without changing total n; and (2) data-dependent allocation introduces selection bias, so the effective sample size for estimation is no longer simply n. For adaptive policies, the product identity becomes a lower bound (with constants that absorb regularity conditions), not an equality.

Practically, muxer operates at small K (2-10 arms) and moderate T (hundreds to low thousands of windowed observations). At these scales, the asymptotic impossibility results may not bind: all three objectives can often be simultaneously satisfied at acceptable levels. The sliding-window design further blunts the static/adaptive distinction, since the effective horizon is the window size, not T.

§The contextual revival – and its subtlety

In the contextual regime (per-request feature vectors via LinUcb), the design measure gains spatial dimensions, but objectives do not automatically diverge. The mechanism controlling collapse vs. revival is average-case vs. worst-case aggregation, not “contextual vs. non-contextual” per se:

  • Average detection delay has sensitivity s(x) ~ -1/p_a(x)^2, which is proportional to nonparametric IMSE sensitivity everywhere. Average detection is structurally redundant with estimation even in contextual settings. Adding contexts does not break this proportionality.

  • Worst-case detection delay (D_max = max_j D_j) concentrates its sensitivity on the bottleneck cell – the (arm, covariate) pair with the fewest observations. This is a point mass, linearly independent from both the regret sensitivity (a ramp near decision boundaries) and the estimation sensitivity (D-optimal / extremal). Worst-case detection is genuinely independent from regret and estimation.

In the non-contextual case (one cell), average and worst-case are identical, so the distinction is moot. In the contextual case (many cells), they diverge: average detection remains redundant with estimation, but worst-case detection introduces a genuinely new objective axis.

Concretely, each objective wants a different sampling distribution:

  • Regret-optimal: concentrate near decision boundaries.
  • Estimation-optimal: spread to extremes (D-optimal experimental design).
  • Detection-optimal (worst-case): ensure no cell is starved (space-filling).

LinUcb exists to break the non-contextual collapse: it learns per-request routing without maintaining separate per-facet histories, at the cost of requiring explicit monitoring budget beyond what regret-optimal sampling provides.

§Saturation principle

The number of genuinely independent objectives is bounded by the effective dimension of the design space:

  dim(Pareto front) <= min(m - 1, D_eff)

where m is the number of named objectives and D_eff is the design dimension (K-1 for non-contextual, ~K*M for M covariate cells, infinite for continuous covariates). Adding objectives beyond D_eff + 1 cannot create new tradeoffs.

The formal algebraic rank can overstate the practical number of tradeoffs. The effective Pareto dimension is better measured by the singular value spectrum of the Jacobian of objectives with respect to design variables: a K=3, M=9 setup with 8 named objectives achieves formal rank 8 but effective dimension ~3-4 (the top 2 singular values carry >99% of the Frobenius norm). See pare::sensitivity for computational tools.

§Design measure perspective

The fundamental object is not “which objectives matter” but the design measure: the joint distribution over (arm, covariate, time) that the policy induces. Given the design measure, every objective is computable.

Note: the design measure in an adaptive setting is a random object (it depends on the realized trajectory), not a fixed distribution. Reasoning about it requires either working with the expected design measure (which loses adaptivity) or a conditional analysis that respects the filtration (which is harder and may break clean proportionality results). See Hadad et al. (arXiv:1911.02768) for the observed vs. expected Fisher information distinction in adaptive experiments.

Non-stationary bandits and sliding windows. Garivier & Moulines (2008, arXiv:0805.3415) established the Sliding-Window UCB (SW-UCB) algorithm, which achieves O(sqrt(Υ_T * K * T * log T)) regret for piecewise-stationary environments with Υ_T changepoints. This is the theoretical foundation for muxer’s sliding-window approach. Optimal window size is O(sqrt(T / Υ_T)); in practice muxer uses a fixed caller-chosen cap.

Non-stationary bandits with explicit detection. Besson, Kaufmann, Maillard & Seznec (2019, arXiv:1902.01575, JMLR 2023) introduced GLR-klUCB: a parameter-free algorithm combining kl-UCB with a Bernoulli Generalized Likelihood Ratio Test. It is the closest published analog to muxer’s monitored selection path. Key difference: GLR-klUCB restarts all arm statistics on detection; muxer instead switches to worst-first routing to investigate the flagged arm, preserving history. See TriageSession and WorstFirstConfig.

Bandit Quickest Changepoint Detection (BQCD). Gopalan, Saligrama & Lakshminarayanan (2021, arXiv:2107.10492) established the BQCD lower bound: any algorithm with mean-time-to-false-alarm m must suffer expected detection delay Ω(log(m) / D*), where D* is the maximum KL divergence between post-change and pre-change distributions across arms. Their ε-GCD algorithm achieves this with exploration rate ε = Θ(1/√T). CoverageConfig’s minimum sampling-rate floor is the practical lever for approaching this bound.

Sampling-constrained detection. Zhang & Mei (2020, arXiv:2009.11891) directly analyze changepoint detection under sampling-rate constraints, confirming that detection delay in wall time scales as h / (KL(p1 || p0) * rate_k) — the formal basis for the two-clocks approximation.

Regret–BAI Pareto frontier. Zhong, Cheung & Tan (2021, arXiv:2110.08627) formally prove the Pareto tradeoff between regret minimization (RM) and best-arm identification (BAI): achieving O(log T) regret and O(log T) BAI error simultaneously is impossible. The product-identity formulation in muxer’s docs is the static-schedule special case.

Piecewise-stationary multi-objective bandits. Rezaei Balef & Maghsudi (2023, arXiv:2302.05257) study the combination of non-stationarity and multi-objective rewards directly. muxer operates in this space — without claiming worst-case-optimal bounds — by combining sliding-window summaries with Pareto scalarization and explicit monitoring.

Window-limited CUSUM. Xie, Moustakides & Xie (2022, arXiv:2206.06777) connect windowed observation to CUSUM optimality, providing theoretical grounding for MonitoredWindow’s split between baseline and recent windows.

Multi-objective bandit frameworks.

  • Constrained / safe bandits (BwK): muxer’s max_junk_rate, max_drift, etc. are BwK-style anytime constraints. (Badanidiyuru, Kleinberg & Slivkins 2013, FOCS; arXiv:1305.2545)
  • Pareto bandits (Drugan & Nowé, IJCNN 2013): muxer’s pare-based frontier is the selection-time analogue.
  • Information-Directed Sampling (Russo & Van Roy 2014, arXiv:1403.5556): scalarizes regret/information via the information ratio. The three-objective extension is non-trivial only in the contextual worst-case-detection regime.
  • Multi-objective RL: muxer’s deterministic post-Pareto scalarization is a pragmatic instance of this literature.
  • Adaptive experiment design (Hadad et al. 2021, arXiv:1911.02768): the observed vs. expected Fisher information distinction applies to muxer’s adaptive design-measure analysis.

Re-exports§

pub use monitor::calibrate_cusum_threshold;
pub use monitor::simulate_cusum_null_max_scores;
pub use monitor::calibrate_threshold_from_max_scores;
pub use monitor::DriftConfig;
pub use monitor::DriftMetric;
pub use monitor::MonitoredWindow;
pub use monitor::RateBoundMode;
pub use monitor::ThresholdCalibration;
pub use monitor::UncertaintyConfig;

Modules§

monitor
Monitoring primitives: drift detection + uncertainty summaries.

Structs§

BetaStats
Beta posterior state for one arm.
CandidateDebug
Debug row for one candidate arm.
CatKlGuardDecision
Output of applying a categorical KL guard to a candidate set.
ContextBinConfig
Configuration for quantising a feature vector into a discrete context bin.
ContextualCell
An (arm, context-bin) pair identifying a specific covariate cell for triage.
ContextualCoverageTracker
Per-(arm, context-bin) observation tracker.
ContextualPolicyFill
Result of a contextual policy fill (LinUCB-based selection with context features).
ControlConfig
Configuration for control arm reservation.
CoverageConfig
Coverage / maintenance sampling configuration.
CusumGuardDecision
Output of applying a categorical CUSUM guard to a candidate set.
Decision
A single policy decision in a unified envelope.
DriftGuardDecision
Output of applying a drift guard to a candidate set.
Exp3Ix
Seedable EXP3-IX bandit.
Exp3IxConfig
Configuration for EXP3-IX.
Exp3IxState
Serializable EXP3-IX state snapshot (for persistence).
LatencyGuardrailConfig
Latency guardrail configuration.
LinUcb
Seedable linear contextual bandit (LinUCB).
LinUcbArmState
Serializable per-arm state for LinUCB persistence.
LinUcbConfig
Configuration for linear UCB.
LinUcbState
Serializable LinUCB state snapshot (for persistence across process restarts).
MabConfig
Configuration knobs for deterministic MAB-style selection.
MabSelectionDecision
Additional metadata for a deterministic select_mab decision.
MonitoredMabConfig
Extended configuration for monitored MAB selection.
Outcome
A single observed outcome for an arm.
OutcomeIdx
Categorical outcome index constants for the 4-category muxer outcome space.
PolicyFill
Result of filling up to k picks using the shared policy pipeline.
PolicyPlan
Planned policy result for a single selection step.
Router
Stateful routing session.
RouterConfig
Full configuration for a Router.
RouterDecision
Output of a single Router::select call.
RouterSnapshot
A serializable snapshot of Router state.
Selection
Output of select_mab (chosen arm + debugging context).
StickyConfig
Stickiness configuration.
StickyMab
Stateful stickiness wrapper for deterministic select_mab.
Summary
Aggregate counts/sums over a window of outcomes.
ThompsonConfig
Configuration for Thompson sampling.
ThompsonSampling
Seedable Thompson-sampling bandit.
ThompsonState
Serializable Thompson-sampling state snapshot (for persistence across process restarts).
TriageSession
Combined detect-then-triage session.
TriageSessionConfig
Configuration for a TriageSession.
Window
Sliding-window statistics for an arm.
WorstFirstConfig
Configuration for worst-first scoring.

Enums§

DecisionNote
Audit-friendly notes attached to a decision.
DecisionPolicy
Which policy produced a decision.
PipelineOrder
Pipeline ordering: whether novelty/coverage pre-picks run before or after the guardrail.
RouterMode
Current operating mode of the Router.

Traits§

BanditPolicy
Common interface for stateful stochastic bandit policies.

Functions§

apply_prior_counts_to_summary
Apply a pseudo-count prior to a Summary until it reaches target_calls.
context_bin
Compute a stable context bin ID from a feature vector.
contextual_worst_first_pick_k
Pick up to k (arm, context-bin) cells for contextual worst-first triage (without replacement).
contextual_worst_first_pick_one
Pick one (arm, context-bin) cell for contextual worst-first regression triage.
coverage_pick_under_sampled
Deterministically pick up to k arms that are under-sampled relative to a target quota.
coverage_pick_under_sampled_idx
Deterministically pick up to k arms (by index) that are under-sampled relative to a target quota.
guardrail_filter_observed
Apply the latency guardrail using observed (non-smoothed) stats.
guardrail_filter_observed_elapsed
Like guardrail_filter_observed, but the observation function returns the raw elapsed-ms sum.
novelty_pick_unseen
Deterministically pick up to k arms that are “unseen” under the provided observed_calls.
pick_control_arms
Pick up to cfg.control_k arms uniformly at random (deterministic, seed-based).
policy_fill_generic
Generic policy fill: compute a plan via policy_plan_generic, then fill remaining picks using pick_rest.
policy_fill_k_contextual
Contextual variant of policy_fill_k_observed_with: uses LinUCB for the algorithmic selection step instead of a flat MAB/EXP3-IX policy.
policy_fill_k_observed_guardrail_first_with_coverage
Delegates to policy_fill_generic with PipelineOrder::GuardrailFirst and caller-provided coverage.
policy_fill_k_observed_with_coverage
Delegates to policy_fill_generic with PipelineOrder::NoveltyFirst and caller-provided coverage.
policy_plan_generic
Generic policy plan: compute pre-picks (novelty + optional coverage) and guardrail filtering, with ordering controlled by order.
select_k_without_replacement_by
Convenience wrapper over select_k_without_replacement_by_with_meta when no meta is needed.
select_mab
Deterministic selection:
select_mab_decide
Unified decision envelope for deterministic MAB selection.
select_mab_explain
Like select_mab, but also returns metadata about constraints and explore-first behavior.
select_mab_monitored_decide
Unified decision envelope for monitored deterministic MAB selection.
select_mab_monitored_explain
Monitored deterministic selection (baseline vs recent drift + uncertainty-aware rates).
select_mab_monitored_explain_with_summaries
Like select_mab_monitored_explain, but uses caller-provided summaries for the base objectives (e.g. prior-smoothed rates), while still computing monitoring scores from monitored.
softmax_map
Compute a stable softmax distribution over a map of scores.
split_control_budget
Split a k-pick budget into control and MAB allocations.
stable_hash64
Deterministic (non-crypto) stable hash used for “random” sampling / tie-breaking.
suggested_window_cap
Suggest a Window capacity based on expected throughput and changepoint rate.
worst_first_pick_k
Pick up to k arms for worst-first regression hunting (without replacement).
worst_first_pick_one
Pick one arm for “worst-first” regression hunting.

Type Aliases§

LinUcbScore
Per-arm score tuple: (ucb, mean, bonus).