atomr-infer-remote-core 0.8.0

Remote-provider infrastructure for atomr-infer — distributed rate limiter (CRDT), circuit breaker, retry/backoff, SSE parser, session lifecycle.
Documentation

atomr-infer-remote-core

Shared infrastructure for every remote-network runtime. The seam where HTTP/2 connection pooling meets the actor system; where rate limits become a cluster-wide CRDT; where 5xx storms get isolated behind a circuit breaker.

What's in here

Component What it does
RemoteEngineCoreActor Per-replica HTTP orchestrator: bounded priority queue + worker pool.
RemoteWorkerActor One per concurrent slot; pulls from the queue, retries with backoff, emits TokenChunks.
RemoteSessionActor Credential lifecycle — analog of ContextActor for the network world.
RateLimiterActor Approximate distributed token bucket via atomr_distributed_data::GCounter.
StrictRateLimiterActor Cluster-singleton variant for premium API keys with hard caps.
CircuitBreakerActor + CircuitBreakerHandle State machine (Closed → Open → Half-open) per (provider, endpoint).
RetryEngine Per-call retry decisions with Retry-After parsing and jitter.
decode_sse_stream Provider-agnostic SSE byte-stream → SseChunk decoder.
classify_http_status Status code + Retry-After → typed InferenceError.

Why all this lives in one crate

Every remote provider — OpenAI, Anthropic, Gemini, LiteLLM, Bedrock, Cohere, an internal LLM gateway — needs the same scaffolding: HTTP/2 client, rate limiting, retry-with-backoff, circuit breaker, SSE parsing, credential refresh. Dropping that scaffolding into one shared crate means each per-provider crate ships as a thin ModelRunner impl plus wire types — typically <300 LOC.

Distributed rate limiting in 30 seconds

use inference_remote_core::{RateLimiterActor, AcquirePermit};
use inference_core::deployment::RateLimits;

let mut rl = RateLimiterActor::new(
    "node-a",
    RateLimits {
        requests_per_minute: Some(10_000),
        tokens_per_minute:   Some(10_000_000),
        ..Default::default()
    },
);

RateLimiterActor keeps a per-node GCounter of tokens spent in the current window. The replicator (when wired up via atomr-cluster) syncs deltas across nodes so the cluster collectively respects the provider's RPM/TPM budget. For premium API keys with hard caps, swap in StrictRateLimiterActor and register it as a cluster singleton — exact accounting at the cost of an extra mailbox hop per request.

Circuit breakers that don't lie

use inference_remote_core::CircuitBreakerHandle;
use inference_core::runtime::{CircuitBreakerConfig, ProviderKind};

let breaker = CircuitBreakerHandle::new(ProviderKind::OpenAi, CircuitBreakerConfig::default());
breaker.run(|| async { /* HTTP call */ Ok(()) }).await

When sustained failures (5xx, network errors, timeouts) trip the threshold, the breaker opens and breaker.check() returns InferenceError::CircuitOpen { provider, opened_at_unix_ms, retry_at_unix_ms }. Upstream actors decide: fall back to a different deployment, surface a 429 to the caller, or queue. 429s and content-filter refusals deliberately don't count toward the circuit — those are handled by the rate limiter and the per-provider error classifier.

Building a new provider on top

// crates/atomr-infer-runtime-bedrock/src/runner.rs (sketch)
use inference_core::runner::ModelRunner;
use inference_remote_core::sse::decode_sse_stream;
use inference_remote_core::session::SessionSnapshot;

pub struct BedrockRunner { /* SessionSnapshot + endpoint + region */ }

#[async_trait::async_trait]
impl ModelRunner for BedrockRunner {
    async fn execute(&mut self, batch: ExecuteBatch) -> InferenceResult<RunHandle> {
        // 1. Build the request.
        // 2. POST it via the session's reqwest::Client.
        // 3. Wrap response.bytes_stream() with decode_sse_stream.
        // 4. Convert per-chunk JSON into TokenChunk.
    }
    /* ... */
}

The crate carries its own wire types and pricing tables; everything else (rate limit, circuit breaker, retry, session) is reusable from this crate.

Dependencies

Pure HTTP / async — no GPU, no Python:

  • reqwest (rustls + http2 + json + stream) for HTTP/2
  • eventsource-stream for SSE parsing
  • tower for middleware composition
  • atomr-core + atomr-distributed-data for actor + CRDT

Importantly, atomr-accel is not a dependency. This crate is the linchpin of the remote-only invariant: a build with --features remote-only reaches atomr-infer-remote-core and stops — no cudarc, no pyo3, no GPU code anywhere in the dep graph.