Expand description
muxer: deterministic, multi-objective routing primitives for
piecewise-stationary multi-armed bandit problems.
Given K arms (model versions, inference endpoints, backends, or any
discrete action set selected repeatedly), the agent observes
vector-valued outcomes per call and selects the next arm. Reward
distributions are piecewise-stationary: they may change at unknown
times, and the agent must detect and adapt to these changes.
An Outcome carries four caller-defined quality fields:
ok: the call produced a usable result.junk: quality was below your threshold (hard_junk=trueimpliesjunk=true).hard_junk: the call failed entirely — a harsher subset of junk, penalized separately.quality_score: Option<f64>: optional continuous quality signal[0,1](higher = better). Supplements the binary flags without changing routing semantics.
Plus cost_units (caller-defined cost proxy) and elapsed_ms (wall-clock time).
Goals:
- Deterministic by default: same stats + config → same choice.
- Non-stationarity friendly: sliding-window summaries, not lifetime averages.
- Multi-objective: Pareto frontier over ok/junk/cost/latency, then scalarization.
- Small K: designed for 2–10 arms; not intended for K in the hundreds.
Selection policies:
select_mab/select_mab_explain/select_mab_monitored_explain: deterministic Pareto + scalarization. WhenMabConfig::quality_weight> 0,mean_quality_scoreis a separate Pareto dimension (6th/9th); scalarization addsquality_weight × mean_quality_scoreexplicitly.ThompsonSampling: seedable Thompson sampling for scalar rewards in[0, 1].Exp3Ix: seedable EXP3-IX for adversarial / fast-shifting rewards.BanditPolicy(featurestochastic): commondecide/update_rewardtrait forThompsonSamplingandExp3Ix— enables generic policy code.softmax_map: stable score → probability helper for traffic splitting.- (feature
contextual)LinUcb: linear contextual bandit.
Operational primitives:
TriageSession: detect-then-triage — per-arm CUSUM detection wired to per-cell (arm × context-bin) investigation.WorstFirstConfig/worst_first_pick_k: post-detection investigation routing.CoverageConfig/coverage_pick_under_sampled: maintenance sampling floor.LatencyGuardrailConfig: hard pre-filter by mean latency.PipelineOrder/policy_plan_generic/policy_fill_generic: harness glue.
Non-goals:
- Not a full bandit platform (no storage, OPE pipelines, dashboards).
contextualis intentionally a small, pragmatic policy module.
§The Three Objectives and the Objective Manifold
Every routing decision simultaneously serves three purposes:
- Exploitation (regret minimization): route to the best arm now.
- Estimation (learning): understand how each arm performs across conditions.
- Detection/triage (monitoring): notice when an arm breaks, then investigate.
These map to the modules of this crate:
- Selection (
select_mab,Exp3Ix,ThompsonSampling): objectives 1+2. - Monitoring (
monitor): objective 3 – drift/catKL/CUSUM are the detection surface of the design measure. - Triage (
worst_first): active investigation after detection fires – prioritize historically broken arms to characterize the change.
§The non-contextual collapse (static schedules)
For fixed (non-adaptive) allocation schedules with K arms and Gaussian rewards, estimation error (MSE ~ 1/n) and average detection delay (D_avg ~ C*T / (n * delta^2)) are both monotone-decreasing in the suboptimal-arm pull count n, with proportional gradients everywhere:
D_avg = (2 * C * sigma^2 * ln(1/alpha) / delta^2) * MSEThis proportionality is structural, not approximate: both functionals care only
about “how many observations at this cell”, and their sensitivity functions are
scalar multiples of each other. The three-way Pareto surface collapses to a
one-dimensional curve parameterized by n. This yields the product identity
R_T * D_avg = C * Delta * T / delta^2 for the restricted class of uniform
schedules.
Categorical note. The formula above uses Gaussian notation (sigma^2, mean
shift delta). muxer operates on categorical outcomes (ok/junk/hard_junk),
where detection delay scales as h / KL(p1 || p0) in sample time rather than
2*b*sigma^2 / delta^2. The proportionality structure — MSE and average
detection delay both O(1/n_k) — holds identically; only the constants change.
See pare::sensitivity for the general form.
Caveat: adaptive policies. This clean proportionality holds exactly only for static (non-adaptive) schedules. Adaptive policies (UCB, Thompson Sampling) break it in at least two ways: (1) they allocate the suboptimal arm in bursts during exploration phases rather than uniformly, worsening worst-case detection delay relative to uniform spacing without changing total n; and (2) data-dependent allocation introduces selection bias, so the effective sample size for estimation is no longer simply n. For adaptive policies, the product identity becomes a lower bound (with constants that absorb regularity conditions), not an equality.
Practically, muxer operates at small K (2-10 arms) and moderate T (hundreds to
low thousands of windowed observations). At these scales, the asymptotic
impossibility results may not bind: all three objectives can often be simultaneously
satisfied at acceptable levels. The sliding-window design further blunts the
static/adaptive distinction, since the effective horizon is the window size, not T.
§The contextual revival – and its subtlety
In the contextual regime (per-request feature vectors via LinUcb), the design
measure gains spatial dimensions, but objectives do not automatically diverge.
The mechanism controlling collapse vs. revival is average-case vs. worst-case
aggregation, not “contextual vs. non-contextual” per se:
-
Average detection delay has sensitivity
s(x) ~ -1/p_a(x)^2, which is proportional to nonparametric IMSE sensitivity everywhere. Average detection is structurally redundant with estimation even in contextual settings. Adding contexts does not break this proportionality. -
Worst-case detection delay (
D_max = max_j D_j) concentrates its sensitivity on the bottleneck cell – the (arm, covariate) pair with the fewest observations. This is a point mass, linearly independent from both the regret sensitivity (a ramp near decision boundaries) and the estimation sensitivity (D-optimal / extremal). Worst-case detection is genuinely independent from regret and estimation.
In the non-contextual case (one cell), average and worst-case are identical, so the distinction is moot. In the contextual case (many cells), they diverge: average detection remains redundant with estimation, but worst-case detection introduces a genuinely new objective axis.
Concretely, each objective wants a different sampling distribution:
- Regret-optimal: concentrate near decision boundaries.
- Estimation-optimal: spread to extremes (D-optimal experimental design).
- Detection-optimal (worst-case): ensure no cell is starved (space-filling).
LinUcb exists to break the non-contextual collapse: it learns per-request routing
without maintaining separate per-facet histories, at the cost of requiring explicit
monitoring budget beyond what regret-optimal sampling provides.
§Saturation principle
The number of genuinely independent objectives is bounded by the effective dimension of the design space:
dim(Pareto front) <= min(m - 1, D_eff)where m is the number of named objectives and D_eff is the design dimension
(K-1 for non-contextual, ~K*M for M covariate cells, infinite for continuous
covariates). Adding objectives beyond D_eff + 1 cannot create new tradeoffs.
The formal algebraic rank can overstate the practical number of tradeoffs. The
effective Pareto dimension is better measured by the singular value spectrum of
the Jacobian of objectives with respect to design variables: a K=3, M=9 setup with
8 named objectives achieves formal rank 8 but effective dimension ~3-4 (the top 2
singular values carry >99% of the Frobenius norm). See pare::sensitivity for
computational tools.
§Design measure perspective
The fundamental object is not “which objectives matter” but the design measure: the joint distribution over (arm, covariate, time) that the policy induces. Given the design measure, every objective is computable.
Note: the design measure in an adaptive setting is a random object (it depends on the realized trajectory), not a fixed distribution. Reasoning about it requires either working with the expected design measure (which loses adaptivity) or a conditional analysis that respects the filtration (which is harder and may break clean proportionality results). See Hadad et al. (arXiv:1911.02768) for the observed vs. expected Fisher information distinction in adaptive experiments.
§Related work
Non-stationary bandits and sliding windows.
Garivier & Moulines (2008, arXiv:0805.3415) established the Sliding-Window UCB
(SW-UCB) algorithm, which achieves O(sqrt(Υ_T * K * T * log T)) regret for
piecewise-stationary environments with Υ_T changepoints. This is the theoretical
foundation for muxer’s sliding-window approach. Optimal window size is
O(sqrt(T / Υ_T)); in practice muxer uses a fixed caller-chosen cap.
Non-stationary bandits with explicit detection.
Besson, Kaufmann, Maillard & Seznec (2019, arXiv:1902.01575, JMLR 2023) introduced
GLR-klUCB: a parameter-free algorithm combining kl-UCB with a Bernoulli Generalized
Likelihood Ratio Test. It is the closest published analog to muxer’s monitored
selection path. Key difference: GLR-klUCB restarts all arm statistics on
detection; muxer instead switches to worst-first routing to investigate the flagged
arm, preserving history. See TriageSession and WorstFirstConfig.
Bandit Quickest Changepoint Detection (BQCD).
Gopalan, Saligrama & Lakshminarayanan (2021, arXiv:2107.10492) established the BQCD
lower bound: any algorithm with mean-time-to-false-alarm m must suffer expected
detection delay Ω(log(m) / D*), where D* is the maximum KL divergence between
post-change and pre-change distributions across arms. Their ε-GCD algorithm achieves
this with exploration rate ε = Θ(1/√T). CoverageConfig’s minimum sampling-rate
floor is the practical lever for approaching this bound.
Sampling-constrained detection.
Zhang & Mei (2020, arXiv:2009.11891) directly analyze changepoint detection under
sampling-rate constraints, confirming that detection delay in wall time scales as
h / (KL(p1 || p0) * rate_k) — the formal basis for the two-clocks approximation.
Regret–BAI Pareto frontier.
Zhong, Cheung & Tan (2021, arXiv:2110.08627) formally prove the Pareto tradeoff
between regret minimization (RM) and best-arm identification (BAI): achieving
O(log T) regret and O(log T) BAI error simultaneously is impossible. The
product-identity formulation in muxer’s docs is the static-schedule special case.
Piecewise-stationary multi-objective bandits.
Rezaei Balef & Maghsudi (2023, arXiv:2302.05257) study the combination of
non-stationarity and multi-objective rewards directly. muxer operates in this
space — without claiming worst-case-optimal bounds — by combining sliding-window
summaries with Pareto scalarization and explicit monitoring.
Window-limited CUSUM.
Xie, Moustakides & Xie (2022, arXiv:2206.06777) connect windowed observation to
CUSUM optimality, providing theoretical grounding for MonitoredWindow’s split
between baseline and recent windows.
Multi-objective bandit frameworks.
- Constrained / safe bandits (BwK):
muxer’smax_junk_rate,max_drift, etc. are BwK-style anytime constraints. (Badanidiyuru, Kleinberg & Slivkins 2013, FOCS; arXiv:1305.2545) - Pareto bandits (Drugan & Nowé, IJCNN 2013):
muxer’spare-based frontier is the selection-time analogue. - Information-Directed Sampling (Russo & Van Roy 2014, arXiv:1403.5556): scalarizes regret/information via the information ratio. The three-objective extension is non-trivial only in the contextual worst-case-detection regime.
- Multi-objective RL:
muxer’s deterministic post-Pareto scalarization is a pragmatic instance of this literature. - Adaptive experiment design (Hadad et al. 2021, arXiv:1911.02768): the
observed vs. expected Fisher information distinction applies to
muxer’s adaptive design-measure analysis.
Re-exports§
pub use monitor::calibrate_cusum_threshold;pub use monitor::simulate_cusum_null_max_scores;pub use monitor::calibrate_threshold_from_max_scores;pub use monitor::DriftConfig;pub use monitor::DriftMetric;pub use monitor::MonitoredWindow;pub use monitor::RateBoundMode;pub use monitor::ThresholdCalibration;pub use monitor::UncertaintyConfig;
Modules§
- monitor
- Monitoring primitives: drift detection + uncertainty summaries.
Structs§
- Beta
Stats - Beta posterior state for one arm.
- Candidate
Debug - Debug row for one candidate arm.
- CatKl
Guard Decision - Output of applying a categorical KL guard to a candidate set.
- Context
BinConfig - Configuration for quantising a feature vector into a discrete context bin.
- Contextual
Cell - An (arm, context-bin) pair identifying a specific covariate cell for triage.
- Contextual
Coverage Tracker - Per-(arm, context-bin) observation tracker.
- Contextual
Policy Fill - Result of a contextual policy fill (LinUCB-based selection with context features).
- Control
Config - Configuration for control arm reservation.
- Coverage
Config - Coverage / maintenance sampling configuration.
- Cusum
Guard Decision - Output of applying a categorical CUSUM guard to a candidate set.
- Decision
- A single policy decision in a unified envelope.
- Drift
Guard Decision - Output of applying a drift guard to a candidate set.
- Exp3Ix
- Seedable EXP3-IX bandit.
- Exp3
IxConfig - Configuration for EXP3-IX.
- Exp3
IxState - Serializable EXP3-IX state snapshot (for persistence).
- Latency
Guardrail Config - Latency guardrail configuration.
- LinUcb
- Seedable linear contextual bandit (LinUCB).
- LinUcb
ArmState - Serializable per-arm state for LinUCB persistence.
- LinUcb
Config - Configuration for linear UCB.
- LinUcb
State - Serializable LinUCB state snapshot (for persistence across process restarts).
- MabConfig
- Configuration knobs for deterministic MAB-style selection.
- MabSelection
Decision - Additional metadata for a deterministic
select_mabdecision. - Monitored
MabConfig - Extended configuration for monitored MAB selection.
- Outcome
- A single observed outcome for an arm.
- Outcome
Idx - Categorical outcome index constants for the 4-category muxer outcome space.
- Policy
Fill - Result of filling up to
kpicks using the shared policy pipeline. - Policy
Plan - Planned policy result for a single selection step.
- Router
- Stateful routing session.
- Router
Config - Full configuration for a
Router. - Router
Decision - Output of a single
Router::selectcall. - Router
Snapshot - A serializable snapshot of
Routerstate. - Selection
- Output of
select_mab(chosen arm + debugging context). - Sticky
Config - Stickiness configuration.
- Sticky
Mab - Stateful stickiness wrapper for deterministic
select_mab. - Summary
- Aggregate counts/sums over a window of outcomes.
- Thompson
Config - Configuration for Thompson sampling.
- Thompson
Sampling - Seedable Thompson-sampling bandit.
- Thompson
State - Serializable Thompson-sampling state snapshot (for persistence across process restarts).
- Triage
Session - Combined detect-then-triage session.
- Triage
Session Config - Configuration for a
TriageSession. - Window
- Sliding-window statistics for an arm.
- Worst
First Config - Configuration for worst-first scoring.
Enums§
- Decision
Note - Audit-friendly notes attached to a decision.
- Decision
Policy - Which policy produced a decision.
- Pipeline
Order - Pipeline ordering: whether novelty/coverage pre-picks run before or after the guardrail.
- Router
Mode - Current operating mode of the
Router.
Traits§
- Bandit
Policy - Common interface for stateful stochastic bandit policies.
Functions§
- apply_
prior_ counts_ to_ summary - Apply a pseudo-count prior to a
Summaryuntil it reachestarget_calls. - context_
bin - Compute a stable context bin ID from a feature vector.
- contextual_
worst_ first_ pick_ k - Pick up to
k(arm, context-bin) cells for contextual worst-first triage (without replacement). - contextual_
worst_ first_ pick_ one - Pick one (arm, context-bin) cell for contextual worst-first regression triage.
- coverage_
pick_ under_ sampled - Deterministically pick up to
karms that are under-sampled relative to a target quota. - coverage_
pick_ under_ sampled_ idx - Deterministically pick up to
karms (by index) that are under-sampled relative to a target quota. - guardrail_
filter_ observed - Apply the latency guardrail using observed (non-smoothed) stats.
- guardrail_
filter_ observed_ elapsed - Like
guardrail_filter_observed, but the observation function returns the raw elapsed-ms sum. - novelty_
pick_ unseen - Deterministically pick up to
karms that are “unseen” under the providedobserved_calls. - pick_
control_ arms - Pick up to
cfg.control_karms uniformly at random (deterministic, seed-based). - policy_
fill_ generic - Generic policy fill: compute a plan via
policy_plan_generic, then fill remaining picks usingpick_rest. - policy_
fill_ k_ contextual - Contextual variant of
policy_fill_k_observed_with: uses LinUCB for the algorithmic selection step instead of a flat MAB/EXP3-IX policy. - policy_
fill_ k_ observed_ guardrail_ first_ with_ coverage - Delegates to
policy_fill_genericwithPipelineOrder::GuardrailFirstand caller-provided coverage. - policy_
fill_ k_ observed_ with_ coverage - Delegates to
policy_fill_genericwithPipelineOrder::NoveltyFirstand caller-provided coverage. - policy_
plan_ generic - Generic policy plan: compute pre-picks (novelty + optional coverage) and guardrail
filtering, with ordering controlled by
order. - select_
k_ without_ replacement_ by - Convenience wrapper over
select_k_without_replacement_by_with_metawhen no meta is needed. - select_
mab - Deterministic selection:
- select_
mab_ decide - Unified decision envelope for deterministic MAB selection.
- select_
mab_ explain - Like
select_mab, but also returns metadata about constraints and explore-first behavior. - select_
mab_ monitored_ decide - Unified decision envelope for monitored deterministic MAB selection.
- select_
mab_ monitored_ explain - Monitored deterministic selection (baseline vs recent drift + uncertainty-aware rates).
- select_
mab_ monitored_ explain_ with_ summaries - Like
select_mab_monitored_explain, but uses caller-provided summaries for the base objectives (e.g. prior-smoothed rates), while still computing monitoring scores frommonitored. - softmax_
map - Compute a stable softmax distribution over a map of scores.
- split_
control_ budget - Split a k-pick budget into control and MAB allocations.
- stable_
hash64 - Deterministic (non-crypto) stable hash used for “random” sampling / tie-breaking.
- suggested_
window_ cap - Suggest a
Windowcapacity based on expected throughput and changepoint rate. - worst_
first_ pick_ k - Pick up to
karms for worst-first regression hunting (without replacement). - worst_
first_ pick_ one - Pick one arm for “worst-first” regression hunting.
Type Aliases§
- LinUcb
Score - Per-arm score tuple:
(ucb, mean, bonus).