queue-runtime 0.2.0

# Queue Runtime - System Overview


**Version**: 1.0
**Status**: Active Development
**Last Updated**: January 2026

## Purpose


The `queue-runtime` library provides a **provider-agnostic abstraction** over cloud message queue services, enabling applications (particularly GitHub bots) to process events reliably and consistently regardless of whether the underlying infrastructure uses Azure Service Bus or AWS SQS.

## Problem Statement


Services need to:

- Process webhook events reliably with at-least-once delivery guarantees
- Maintain strict ordering for related events (e.g., all events for a single pull request)
- Handle failures gracefully with retry and dead letter queue support
- Deploy flexibly across different cloud providers without code changes
- Scale to handle high-throughput event streams (1000+ messages/second)

Current solutions require:

- Writing provider-specific code for each cloud platform
- Managing different APIs, error handling patterns, and configuration approaches
- Implementing session/ordering semantics differently per provider
- Maintaining separate codebases or complex abstraction layers

## Solution Approach


`queue-runtime` solves these problems by:

1. **Unified API**: Single `QueueClient` trait that works identically across Azure and AWS
2. **Session Abstraction**: Consistent session-based ordering regardless of native provider support
3. **Error Normalization**: Common error types and retry strategies across providers
4. **Type Safety**: Strongly-typed message handling with compile-time guarantees
5. **Runtime Selection**: Provider choice configured at runtime, not compile time

## System Context


```mermaid
graph TB
    subgraph "GitHub"
        WEBHOOKS[Webhook Events]
    end

    subgraph "Bot Applications"
        TT[Task Tactician<br/>PR Assignment]
        MW[Merge Warden<br/>PR Validation]
        SS[Spec Sentinel<br/>Documentation]
    end

    subgraph "queue-runtime Library"
        API[QueueClient API]
        SESSION[Session Manager]
        RETRY[Retry Engine]
    end

    subgraph "Cloud Providers"
        AZURE[Azure Service Bus]
        AWS[AWS SQS]
    end

    WEBHOOKS --> TT
    WEBHOOKS --> MW
    WEBHOOKS --> SS

    TT --> API
    MW --> API
    SS --> API

    API --> SESSION
    API --> RETRY

    API -.Azure.-> AZURE
    API -.AWS.-> AWS

    classDef external fill:#f9f9f9,stroke:#333
    classDef app fill:#e8f5e8,stroke:#4caf50
    classDef lib fill:#e3f2fd,stroke:#2196f3
    classDef cloud fill:#f3e5f5,stroke:#9c27b0

    class WEBHOOKS external
    class TT,MW,SS app
    class API,SESSION,RETRY lib
    class AZURE,AWS cloud
```

## Key Stakeholders


### Primary Users


- **Bot Developers**: Engineers building GitHub automation bots
- **DevOps Engineers**: Teams deploying bots across different cloud environments
- **Platform Engineers**: Infrastructure teams managing message queue resources

### Secondary Users


- **Security Teams**: Requiring credential management and audit trails
- **Operations Teams**: Monitoring queue health and message flow
- **Support Engineers**: Diagnosing message processing issues

## Core Capabilities


### Message Operations


- **Send**: Enqueue messages with metadata, session IDs, and scheduling
- **Receive**: Dequeue messages with timeout and batch controls
- **Complete**: Acknowledge successful processing and remove from queue
- **Abandon**: Return message to queue for reprocessing
- **Dead Letter**: Move permanently failed messages to DLQ

### Session Management


- **Ordered Processing**: Guarantees FIFO delivery within a session
- **Session Locking**: Prevents concurrent processing of same session
- **Session Strategies**: Configurable algorithms for session ID generation
- **Graceful Degradation**: Emulates sessions on providers without native support

### Resilience Features


- **Automatic Retry**: Exponential backoff with jitter for transient failures
- **Circuit Breaking**: Stops processing when downstream systems fail
- **Poison Message Detection**: Identifies and quarantines problematic messages
- **DLQ Integration**: Preserves failed messages for analysis and replay

### Configuration


- **Runtime Provider Selection**: Choose Azure or AWS via configuration
- **Environment-Based Config**: Load settings from environment variables or files
- **Credential Management**: Support managed identities, service principals, IAM roles
- **Sensible Defaults**: Works out-of-box with minimal configuration

## Non-Goals


This library explicitly does **not**:

- Provide message routing or pub/sub capabilities (use provider features directly)
- Implement message transformation or content inspection (application responsibility)
- Manage queue provisioning or infrastructure deployment (use IaC tools)
- Support local/embedded queues for production use (in-memory provider is test-only)
- Abstract all provider-specific features (only common queue operations)

## Constraints and Assumptions


### Technical Constraints


- **Rust Ecosystem**: Library targets Rust applications using tokio async runtime
- **Cloud-Only**: Designed for Azure and AWS, not on-premises message brokers
- **Queue Model**: Assumes point-to-point messaging, not topic/subscription patterns
- **Session Semantics**: Ordering limited to single-session scope, not global

### Operational Assumptions


- **Network Reliability**: Assumes stable network connectivity to cloud services
- **Credential Availability**: Assumes valid credentials configured before use
- **Queue Pre-existence**: Queues must be created before client usage (no auto-creation)
- **Single Region**: Does not handle cross-region replication or failover

### Performance Assumptions


- **Message Size**: Optimized for messages under 256KB (GitHub webhook typical size)
- **Throughput**: Designed for 100-10,000 messages/second per queue
- **Latency**: Targets p95 latency under 200ms for send/receive operations
- **Concurrency**: Supports 10-100 concurrent consumers per queue

## Glossary


### Queue Terminology


**Queue**: A named message buffer that provides reliable, asynchronous delivery between producers and consumers.

**Message**: A unit of data sent through a queue, containing a body (bytes) and metadata (properties).

**Session**: A logical grouping of related messages that must be processed in FIFO order.

**Receipt Handle**: An opaque token that proves ownership of a received message and enables acknowledgment operations.

**Dead Letter Queue (DLQ)**: A special queue for messages that cannot be processed successfully after retry attempts.

### Provider Terminology


**Azure Service Bus**: Microsoft's fully managed enterprise message broker with native session support.

**AWS SQS**: Amazon's fully managed message queuing service with FIFO queue support.

**Message Group**: AWS SQS concept equivalent to sessions, used for ordering guarantees.

**Visibility Timeout**: Duration a received message is hidden from other consumers (AWS SQS term).

**Lock Duration**: Duration a consumer holds exclusive access to a message (Azure Service Bus term).

### Domain Terminology


**Bot**: An automated application that responds to GitHub webhook events.

**Event**: A GitHub webhook payload representing an action in a repository.

**Envelope**: A standardized message wrapper containing event metadata and routing information.

**Provider**: A cloud platform implementation of the queue abstraction (Azure or AWS).

**Session Strategy**: An algorithm for generating session IDs from message content.

## Architecture References


This overview provides context for understanding the detailed architecture:

- [Architecture](./architecture.md) - Logical boundaries and dependencies
- [Vocabulary](./vocabulary.md) - Domain concepts and definitions
- [Responsibilities](./responsibilities.md) - Component roles and collaborations
- [Constraints](./constraints.md) - Implementation rules and boundaries
- [Assertions](./assertions.md) - Behavioral specifications

## Success Criteria


The library succeeds when:

1. **Portability**: Bots can switch between Azure and AWS with only configuration changes
2. **Reliability**: Messages are never lost and processed at-least-once with proper error handling
3. **Ordering**: Session-based messages are processed in strict FIFO order within each session
4. **Performance**: Meets throughput (1000 msg/s) and latency (p95 < 200ms) targets
5. **Maintainability**: Adding new providers requires minimal changes to existing code
6. **Usability**: Developers can integrate the library in under 30 minutes