rustmc

Bayesian inference engine written in Rust. Python API via PyO3.

rustmc runs the entire sampling loop in compiled Rust, with no Python in the inner loop. Chains are parallelized across threads using Rayon. The result is fast enough to fit thousands of independent Bayesian models in a single call.

Why rustmc

PyMC, Stan, and other Bayesian frameworks are built for single-model workflows. You define one model, fit it, and analyze it. This works well for research but falls apart when you need to fit the same model structure to thousands of datasets -- per-store demand models, per-SKU pricing models, per-patient dosing models.

rustmc is designed for that use case. It provides a batch inference API that runs 10,000 independent NUTS chains through a single Rayon thread pool, sharing compute across all available cores with zero serialization overhead.

10,000 Bayesian demand models in 70 seconds, with full posterior uncertainty.

Fitting those same 10,000 models sequentially with ARIMA takes ~160 seconds. With Prophet, ~28 minutes. Neither gives you credible intervals for free.

Benchmark

10 parameters, 100,000 observations, 8 chains, 2,000 draws:

Method	Time	Speedup
rustmc (NUTS)	72s	5.3x
PyMC (NUTS)	383s	1.0x

Batch inference, 10,000 independent 3-parameter models:

Method	Total time	Per model	Uncertainty
rustmc (batch NUTS)	70s	7ms	Yes (full posterior)
ARIMA (sequential)	160s	16ms	No
Prophet (sequential)	28min	170ms	Partial

Quick start

pip install maturin
git clone https://github.com/tbosier/rustmc.git
cd rustmc
python -m venv .venv && source .venv/bin/activate
pip install numpy maturin
maturin develop --manifest-path python_bindings/Cargo.toml --release

or if you prefer, I have made this publically downloadable via pip

pip install rustmc

Single model

import numpy as np
import rustmc as rmc

np.random.seed(42)
x = np.random.randn(1000)
y = 2.5 * x + np.random.randn(1000)

builder = rmc.ModelBuilder()
beta = builder.normal_prior("beta", mu=0.0, sigma=1.0)
mu_expr = beta * "x"
builder.normal_likelihood("obs", mu_expr=mu_expr, sigma=1.0, observed_key="y")
model = builder.build()

fit = rmc.sample(model_spec=model, data={"x": x, "y": y}, chains=4, draws=1000)
print(fit.summary())

Output:

4 chains x 1000 draws per chain

Parameter        mean      std     hdi_3%    hdi_97%   ess_bulk   ess_tail    r_hat  mcse_mean
-----------------------------------------------------------------------------------------------
beta           2.4575   0.0313     2.3982     2.5133       2638       2966   1.0055   0.000610
-----------------------------------------------------------------------------------------------
Mean accept rate: 0.94  |  Divergences: 0

Batch inference (10,000 models)

import rustmc as rmc
import numpy as np

models = []
for i in range(10_000):
    builder = rmc.ModelBuilder()
    intercept = builder.normal_prior("intercept", mu=0.0, sigma=200.0)
    trend = builder.normal_prior("trend", mu=0.0, sigma=20.0)
    mu_expr = intercept + trend * "t"
    builder.normal_likelihood("obs", mu_expr=mu_expr, sigma=5.0, observed_key="y")
    model = builder.build()

    t = np.arange(52, dtype=np.float64) / 52
    y = some_data[i]  # your per-SKU time series
    models.append((model, {"t": t, "y": y}))

results = rmc.batch_sample(models, draws=500, warmup=300)

# Each result has .mean(), .std(), .get_samples()
for r in results[:5]:
    print(r)

Vector parameter model (high-dimensional regression)

For models where the parameter count is large — e.g. a regression with thousands of features — use normal_prior with @ to dispatch X @ beta via faer. rustmc automatically detects that beta is used in a matrix multiply, infers the number of parameters from the matrix dimensions, and promotes it to a contiguous vector parameter block:

import numpy as np
import rustmc as rmc

N, P = 10_000, 500
X = np.random.randn(N, P)           # 2-D array → stored as faer matrix
beta_true = np.random.randn(P)
y = X @ beta_true + np.random.randn(N)

builder = rmc.ModelBuilder()
beta = builder.normal_prior("beta", mu=0.0, sigma=1.0)
mu_expr = beta @ "X"                # auto-promoted to faer GEMV
builder.normal_likelihood("obs", mu_expr=mu_expr, sigma=1.0, observed_key="y")
model = builder.build()

fit = rmc.sample(model_spec=model, data={"X": X, "y": y}, chains=4, draws=500)
print(fit.summary())

Instead of 500 separate scalar graph nodes (one per coefficient), rustmc allocates a single MatVecMul op backed by faer. The entire X @ beta forward pass and its gradient are computed with a single BLAS-level call, giving cache-efficient performance regardless of how many parameters are in the vector.

For explicit control over the vector size, vector_normal_prior("beta", n=P) is also available.

What is implemented

Sampling

NUTS (No-U-Turn Sampler) with multinomial candidate selection, generalized U-turn criterion, and divergence detection. Follows Hoffman and Gelman (2014) and Betancourt (2017).
HMC with fixed leapfrog steps, available as a fallback via sampler="hmc".
Diagonal mass matrix adaptation with 3-phase warmup (step-size only, mass matrix estimation, final step-size tuning).
Auto step-size initialization via binary search.
Deterministic per-chain RNG (ChaCha8) for reproducible results.
Multithreaded chains via Rayon. Batch inference shares the thread pool across all models.

Distributions

Distribution	Support	Transform	Status
Normal	(-inf, inf)	None	Working
StudentT	(-inf, inf)	None	Working
HalfNormal	(0, inf)	log	Working
Gamma	(0, inf)	log	Working
Beta	(0, 1)	logit	Working
Uniform	(a, b)	logit	Working
Bernoulli	{0, 1}	None	Discrete, limited
Poisson	{0, 1, 2, ...}	None	Discrete, limited

Constrained distributions are automatically sampled in unconstrained space via log/logit transforms with Jacobian corrections. Samples are back-transformed before being returned to the user.

Computation

Computational graph with reverse-mode automatic differentiation.
Fused linear combination op for regression models. Replaces N separate multiply-add passes with a single cache-friendly loop over the data.
Zero-allocation evaluator. All vector intermediates are pre-allocated in a flat buffer and reused across gradient evaluations. No heap allocation in the sampling loop.
faer-backed matrix-vector multiply (MatVecMul). When a normal_prior parameter is used with @ (e.g. beta @ "X"), rustmc automatically promotes it to a contiguous vector parameter block and dispatches the multiply to faer's GEMV routine. This replaces thousands of individual scalar multiply-add graph ops with a single BLAS-level call. Rayon threads are used for matrices above 100K elements. Explicit vector_normal_prior is also available for manual control.
Vectorized Normal prior (VectorNormalLogP). A single graph op evaluates the log-probability of an entire parameter vector under Normal(mu, sigma), replacing one graph node per parameter with a single tight loop. Gradients for all vector parameters accumulate directly into the gradient buffer in one backward pass.
2-D NumPy arrays in the data dict are automatically detected and stored as row-major matrices for use with MatVecMul.

Diagnostics

Split R-hat with rank normalization (Vehtari et al. 2021).
Bulk and tail effective sample size (ESS).
Monte Carlo standard error (MCSE).
94% highest density interval.
Per-chain acceptance rates, step sizes, and divergence counts.
Automatic warnings for convergence issues.

Available via fit.summary() for a formatted table or fit.diagnostics() for programmatic access.

Progress reporting

Live progress bar rendered from Rust at 10 Hz using atomic counters, with no GIL involvement:

Sampling 8 chains ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% | 24.0k/24.0k | 0 divergences | 384.0k grad evals | 6.7s

Architecture

Python (orchestration only)
  |
  v  GIL released
Rust Core
  +-- Graph         Computational DAG, nodes, ops, data + matrix storage
  +-- Autodiff      Forward evaluation + reverse-mode gradient
  +-- Distributions  8 distributions with automatic transforms
  +-- NUTS          Multinomial tree-building, U-turn detection
  +-- HMC           Fixed-step leapfrog (fallback)
  +-- Sampler       Multi-chain parallel runner, batch inference
  +-- Diagnostics   R-hat, ESS, MCSE, HDI
  +-- Progress      Atomic counters, background render thread
  +-- faer          BLAS-level MatVecMul for high-dimensional parameter vectors

Design principles:

Model graph is built once and shared read-only across chains.
Sampler accepts any log-probability + gradient function derived from a Graph.
No global state. All state is explicit and owned.
Deterministic RNG per chain (ChaCha8 seeded from base_seed + chain_index).
Parameter transforms and Jacobian corrections are handled in the graph, not the sampler.

Data structures (Rust vs JAX)

The hot path uses plain Rust types only: the graph is Vec<Node> and Vec<Op>, parameters and gradients are Vec<f64>, and the autodiff evaluator uses contiguous vec_buf / adj_vec_buf (flat Vec<f64>) for all vector intermediates. For high-dimensional parameter vectors, data matrices are stored row-major as Vec<f64> inside the graph and handed to faer's matmul kernel as zero-copy views. ndarray appears only in the Python bindings for converting incoming 2-D NumPy arrays; it is not present in the inner loop. Benefits of this layout:

Cache-friendly: One pass over the graph touches sequential memory; vector slots are in a single allocation.
Zero allocation in the loop: Buffers are allocated once per chain and reused for every gradient evaluation.
No Python or FFI in the inner loop: The entire NUTS/HMC step runs in Rust; Python is only used to build the model and consume results.
Fixed graph traversal: The same DAG is walked every time; there is no tracing or recompilation per model or per step.
BLAS-level throughput for large parameter vectors: MatVecMul calls faer's GEMV, which uses SIMD intrinsics and can optionally spawn Rayon threads for matrices above 100K elements. A 5,000-parameter vector prior that previously required 5,000 individual scalar multiply-add nodes in the graph is now a single op.

JAX, by contrast, traces Python and compiles to XLA. That gives flexibility and GPU support but adds per-model compilation and dispatch overhead. For many small, independent models (e.g. 10,000 SKUs), rustmc's "compile once, run fixed graph over contiguous buffers" approach often wins on CPU because there is no per-model JAX trace/compile and no Python in the inner loop. Nutpie (JAX-based) is faster than default PyMC for a single model; the batch example compares rustmc's batch NUTS against PyMC+nutpie run in a loop over the same number of models.

Roadmap

Near term:

Hierarchical priors (parameter as hyperparameter of another parameter's prior)
Link functions and GLMs
Custom likelihood functions
Prior and posterior predictive sampling
LOO-CV (Pareto-smoothed importance sampling)
Trace plots and visual diagnostics

Medium term:

MAP estimation (L-BFGS)
Laplace approximation
Sparse indicator variable support
Stochastic gradient MCMC (SGLD/SGHMC) for large datasets
Model serialization (compile once, deploy without Python)

Long term:

Variational inference (ADVI)
GPU-accelerated log-probability via wgpu
WASM compilation for browser/edge inference
Distributed posterior aggregation
Automatic reparameterization for funnel geometries
C FFI for embedding in non-Python systems

License

MIT

rustmc_core 0.6.0