muxer

Deterministic, multi-objective routing for piecewise-stationary multi-armed bandit problems with small action sets.

Problem setting

Given K arms (model versions, inference endpoints, backends -- any discrete action set selected repeatedly), an agent observes vector-valued outcomes per call and must choose the next arm. Each outcome carries binary quality signals (ok, junk, hard_junk), an optional continuous quality score in [0, 1], a cost proxy, and wall-clock latency. Reward distributions are piecewise-stationary: they may change at unknown times, and the agent must detect and adapt.

The setting is a multi-objective extension of the stochastic MAB [1]: the agent simultaneously minimizes cost, latency, and degradation rate while maximizing success rate. These objectives cannot in general be collapsed to a single scalar without losing information. The core selection algorithm:

Maintains a sliding window of recent Outcomes per arm [6].
Computes a Pareto frontier over the objective vector using standard Pareto dominance [7].
Selects deterministically from the frontier via linear weighted scalarization with configurable weights and a stable lexicographic tie-break.

Standard single-objective approaches fail in characteristic ways: pure exploitation prevents detection of regressions on non-selected arms; uniform allocation wastes budget on known-degraded arms; threshold-based cooldown misses gradual distribution drift. An effective policy must:

explore arms with high uncertainty or recent distributional changes
detect regressions by maintaining sufficient sampling rate on all arms
remain deterministic by default (identical statistics and configuration produce identical selections)

Outcome fields

ok: the call produced a usable result
junk: quality fell below the caller's threshold; hard_junk=true implies junk=true
hard_junk: complete failure (error, timeout, parse failure) -- strict subset of junk, penalized separately
quality_score: Option<f64>: continuous quality signal in [0, 1]. When MabConfig::quality_weight > 0, incorporated as a separate Pareto dimension.
cost_units: caller-defined cost proxy
elapsed_ms: wall-clock time

Designed for 2--10 arms and moderate window sizes (hundreds to low thousands of observations).

Capabilities

Category	Primitive	Description
Selection	`select_mab` / `select_mab_explain`	Deterministic Pareto + scalarization over sliding-window summaries
	`ThompsonSampling`	Seedable Thompson sampling [2] for scalar rewards in `[0, 1]`; optional decay
	`Exp3Ix`	Seedable EXP3-IX [3] for adversarial / non-stationary rewards
	`LinUcb` (feature `contextual`)	Linear contextual bandit [4] with per-request feature vectors
	`BanditPolicy` trait (feature `stochastic`)	Common `decide`/`update_reward` interface over `ThompsonSampling` and `Exp3Ix`
	`softmax_map`	Score-to-probability conversion for traffic splitting
Monitoring	`TriageSession`	Per-arm CUSUM [5] detection with per-`(arm, context-bin)` cell investigation
	`CusumCatBank`	Bank of CUSUM alternatives at multiple reference levels (GLR-inspired [8] robustification)
	`calibrate_cusum_threshold`	Monte Carlo: find threshold `h` such that `P[alarm within m rounds] <= alpha`
Triage	`worst_first_pick_k`	Post-detection: prioritize the most degraded arms for investigation
	`ContextualCoverageTracker`	Lift triage to `(arm, context-bin)` pairs for localized regressions
Coverage	`CoverageConfig`	Maintenance sampling floor: keep all arms measured above a quota
	`ControlConfig` / `pick_control_arms`	Reserve deterministic-random picks as a selection-bias anchor
Guardrails	`LatencyGuardrailConfig`	Hard pre-filter by mean latency with stop-early semantics
	Multi-pick (`select_mab_k_guardrailed_*`)	Select up to `k` unique arms per decision with per-round guardrails
Orchestration	`Router`	Stateful session: owns per-arm state, full select/observe/triage lifecycle, dynamic arm add/remove
	`suggested_window_cap`	SW-UCB-derived [6] window sizing: `sqrt(throughput / change_rate)`
	`PipelineOrder` / `PolicyPlan`	Harness glue for composing custom routing pipelines
Utilities	`Decision` envelope	Unified decision record (`chosen` + `probs` + typed audit `notes`) for logging/replay
	`novelty_pick_unseen` / `apply_prior_counts_to_summary`	Novelty helpers and prior smoothing

Which policy to use

select_mab: Multiple objectives with deterministic selection. O(sqrt(K T ln T)) pseudo-regret on the scalarized objective under stationarity [1].
ThompsonSampling: Single scalar reward in [0, 1]. O(sqrt(K T ln T)) Bayes regret [2]. Seedable; optional decay.
Exp3Ix: Adversarial rewards. O(sqrt(K T ln K)) expected regret [3]. Seedable; optional decay.
LinUcb (feature contextual): Per-request feature vector. O(d sqrt(T ln T)) regret under linear realizability [4].
BanditPolicy trait (feature stochastic): Generic fn run<P: BanditPolicy>(p: &mut P) over both stochastic policies.

Routing lifecycle

Normal (select_mab / ThompsonSampling / Exp3Ix): select the best arm while exploring.
Regression investigation (worst_first_pick_k): after CUSUM [5] fires, route additional traffic to characterize the change. TriageSession automates the handoff.
Control (pick_control_arms): uniform-random baseline to anchor quality estimates and detect selection bias.

CoverageConfig bridges 1 and 3: minimum per-arm sampling rate prevents starvation that would delay regression detection.

Three goals for sampling

Every routing decision serves three competing objectives:

Exploitation -- minimize cumulative regret.
Estimation -- reduce uncertainty about each arm's current performance.
Detection -- minimize delay between a distributional change and the alarm.

The two clocks. An arm is only observed when selected:

Wall time t: global decision steps.
Sample time n_k: observations for arm k.

Detection delay in wall time: delay_wall ~ delay_samples / rate_k. CoverageConfig sets a floor on rate_k.

The non-contextual collapse. Under fixed allocation, MSE and average CUSUM detection delay are both O(1/n_k). Their sensitivity functions are scalar multiples. The three-way Pareto surface collapses to a 1-D curve parameterized by n_k:

R_T * D_avg = C * Delta * T / delta^2

You cannot independently improve both regret and detection delay without increasing the total sample budget.

The contextual revival. With LinUcb, average detection delay stays proportional to estimation error. But worst-case detection delay -- concentrated on the covariate cell with fewest observations -- is genuinely independent. This is why ContextualCoverageTracker and TriageSession exist: localized regressions in sparse covariate regions need cell-level triage.

See the API docs and examples/EXPERIMENTS.md for derivations and failure modes.

Usage

[dependencies]
muxer = "0.3.9"

Deterministic core only (no stochastic bandits):

[dependencies]
muxer = { version = "0.3.9", default-features = false }

Quickstart

use muxer::{Router, RouterConfig, Outcome};

let arms = vec!["backend-a".to_string(), "backend-b".to_string()];
let mut router = Router::new(arms, RouterConfig::default()).unwrap();

loop {
    let d = router.select(1, 0);
    let arm = d.primary().unwrap().to_string();

    let outcome = Outcome { ok: true, junk: false, hard_junk: false,
                            cost_units: 5, elapsed_ms: 120, quality_score: None };
    router.observe(&arm, outcome);

    if router.mode().is_triage() {
        router.acknowledge_change(&arm);
    }
}

For larger arm counts, pass k > 1 to batch exploration:

let cfg = RouterConfig::default().with_coverage(0.02, 1);
let d = router.select(3, 0); // K=30, k=3 -> coverage in ~10 rounds

Examples

Deterministic multi-objective selection

use muxer::{select_mab, MabConfig, Summary};
use std::collections::BTreeMap;

let arms = vec!["a".to_string(), "b".to_string()];
let mut summaries = BTreeMap::new();
summaries.insert("a".to_string(), Summary { calls: 10, ok: 9, junk: 0, hard_junk: 0, cost_units: 10, elapsed_ms_sum: 900, mean_quality_score: None });
summaries.insert("b".to_string(), Summary { calls: 10, ok: 9, junk: 2, hard_junk: 0, cost_units: 10, elapsed_ms_sum: 900, mean_quality_score: None });

let sel = select_mab(&arms, &summaries, MabConfig::default());
assert_eq!(sel.chosen, "a"); // lower junk rate wins

Detect-then-triage

use muxer::{TriageSession, TriageSessionConfig, OutcomeIdx};

let arms = vec!["a".to_string(), "b".to_string()];
let mut session = TriageSession::new(&arms, TriageSessionConfig::default()).unwrap();

session.observe("a", OutcomeIdx::OK, &[0.2, 0.3]);
session.observe("b", OutcomeIdx::HARD_JUNK, &[0.8, 0.9]);

let alarmed = session.alarmed_arms();
let bins = session.tracker().active_bins();
let cells = session.top_alarmed_cells(&bins, 3);

Runnable examples

Getting started

getting_started.rs -- Minimal 3-backend routing loop in 60 lines. Shows the core lifecycle: select an arm, observe an outcome with delayed quality labeling, let the router adapt. Start here.

router_quickstart.rs -- Full routing lifecycle: two arms diverge in quality, CUSUM fires a triage alarm, the operator acknowledges and the router recovers. Also demonstrates large-K batch exploration.

router_production.rs -- Production configuration. CUSUM threshold calibration, monitoring windows, coverage enforcement, guardrails, control arms, and snapshot/restore for process restarts.

cargo run --example getting_started
cargo run --example router_quickstart
cargo run --example router_production --features stochastic

Algorithm variants

deterministic_router.rs -- Multi-objective MAB with hard constraints on junk rate and cost/latency weights. Deterministic tie-breaking for reproducible selections.

thompson_router.rs -- Thompson sampling with decay for non-stationary binary rewards. Two arms with drifting success probabilities.

exp3ix_router.rs -- Exp3-IX for adversarial settings where reward distributions change over time.

contextual_router.rs -- LinUCB with 2D context features (prompt length, difficulty). Arm performance depends on input characteristics.

sticky_mab_router.rs -- Dwell-time and margin-based switching. Prevents backend thrashing when arm estimates are noisy.

monitored_router.rs -- Drift detection via Hellinger distance between baseline and recent observation windows. Maintenance sampling prevents lock-on.

cargo run --example deterministic_router
cargo run --example thompson_router
cargo run --example exp3ix_router
cargo run --example contextual_router --features contextual
cargo run --example sticky_mab_router
cargo run --example monitored_router --features stochastic
cargo run --example end_to_end_router --features stochastic

Configuration patterns

end_to_end_router.rs -- Delayed quality labeling: push outcomes with unknown junk status, update later after validation. Combines stickiness and constraints.

mab_constraints_tuning.rs -- Constraints-first design. Hard-filter arms exceeding 10% junk, then tune cost/latency weights among the clean set.

coverage_autotune.rs -- Derive the coverage sampling rate from KL divergence and CUSUM theory to guarantee regression detection within a wall-time budget.

contextual_propensity_logging.rs -- Extract per-arm selection probabilities from LinUCB decisions for offline inverse-propensity-weighted analysis.

guardrail_semantics.rs -- Soft (novelty before guardrail) vs hard (guardrail-first) policy ordering. Shows how the order of filtering stages affects which arms get trialed.

decision_unified.rs -- The unified Decision type with stickiness. Shows how dwell-time prevents thrashing when the base policy flaps between arms.

window_delayed_junk_label.rs -- Minimal pattern for deferred quality labeling: push an outcome with junk=false, then update to true after downstream validation.

cargo run --example mab_constraints_tuning
cargo run --example coverage_autotune
cargo run --example contextual_propensity_logging --features contextual
cargo run --example guardrail_semantics
cargo run --example decision_unified
cargo run --example window_delayed_junk_label

Detection and calibration

detector_calibration.rs -- Monte Carlo threshold calibration: find the CUSUM threshold h such that P[false alarm within m rounds] <= alpha.

detector_inertia.rs -- Cumulative vs forgetful detectors. CatKL accumulates history and loses sensitivity over time; CUSUM forgets and stays responsive. Shows the effect of pre-change baseline length on detection delay.

bqcd_sampling.rs -- Partial-observation multi-armed detection. K arms are only observed when sampled, so the policy must balance exploration (find the changed arm), exploitation (focus on it), and coverage (don't starve others).

bqcd_calibrated.rs -- Global false-alarm calibration across arms with a CUSUM bank of alternative hypotheses.

cargo run --example detector_calibration
cargo run --example detector_inertia
cargo run --example bqcd_sampling
cargo run --example bqcd_calibrated

Domain harnesses

Each harness simulates a realistic routing scenario with multi-dimensional slicing (task x dataset x environment), per-slice arm eligibility, and injected drift at a known epoch.

matrix_harness.rs -- NLP evaluation: 5 task-dataset pairs (NER, relation extraction, coreference). A BERT-ONNX model drifts on social-media NER at epoch 48.

pcap_triage_harness.rs -- Network security: route packet-analysis backends (Suricata, Zeek, ML) per protocol/dataset. TLS evasion causes drift.

ad_auction_harness.rs -- Ad ranking: route auction models per objective/geo/device. A privacy policy change degrades the deep model on EU mobile traffic.

fraud_scoring_harness.rs -- Payments fraud: route fraud scorers per channel/region/segment. A rules model drifts on EU card-not-present after a fraud pattern shift.

medical_triage_harness.rs -- Clinical risk scoring: route triage models by setting/acuity/cohort. A protocol change breaks high-acuity pediatric ED scoring.

search_ranking_harness.rs -- Search ranking: route rankers by intent/locale/device. BM25 regresses on informational Spanish queries after a corpus shift.

synthetic_drift_harness.rs -- Controlled experiment with known drift injection point. Validates that the router detects and adapts correctly.

cargo run --example matrix_harness
cargo run --example pcap_triage_harness
cargo run --example ad_auction_harness
cargo run --example fraud_scoring_harness
cargo run --example medical_triage_harness
cargo run --example search_ranking_harness
cargo run --example synthetic_drift_harness

Investigation

free_lunch_investigation.rs -- When does change detection require coverage? Compares Thompson sampling, deterministic MAB, and MAB+coverage when one arm drifts. Shows that passive monitoring fails if the changed arm gets starved of observations.

cargo run --example free_lunch_investigation

Mini-experiments (trade-offs and failure modes): examples/EXPERIMENTS.md.

Documentation

Development

cargo test -p muxer
cargo bench -p muxer --bench coverage
cargo bench -p muxer --bench monitor
cargo fmt -p muxer --check
cargo clippy -p muxer --all-targets -- -D warnings

References

P. Auer, N. Cesa-Bianchi, and P. Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine Learning, 47(2-3):235--256, 2002.
S. Agrawal and N. Goyal. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. "The Nonstochastic Multiarmed Bandit Problem." SIAM J. Comput., 32(1):48--77, 2002.
W. Chu, L. Li, L. Reyzin, and R. Schapire. "Contextual Bandits with Linear Payoff Functions." AISTATS, 2011.
E. S. Page. "Continuous inspection schemes." Biometrika, 41(1-2):100--115, 1954.
A. Garivier and E. Moulines. "On Upper-Confidence Bound Policies for Switching Bandit Problems." ALT, 2011.
R. R. Drugan and A. Nowe. "Designing multi-objective multi-armed bandits algorithms: A study." IJCNN, 2013.
L. Besson, E. Kaufmann, O.-A. Maillard, and J. Seznec. "Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits." arXiv:1902.01575, 2019.

License

Licensed under MIT or Apache-2.0 (LICENSE-MIT, LICENSE-APACHE).

muxer 0.3.11