queue-runtime 0.2.0

# Queue Runtime - Edge Cases and Failure Modes


## Overview


This document catalogs non-standard flows, edge cases, and failure scenarios that queue-runtime must handle correctly. Each case includes description, expected behavior, and testing considerations.

---

## Message Processing Edge Cases


### Edge Case 1: Empty Queue with Timeout


**Scenario**: Application calls `receive_message()` on an empty queue with a timeout.

**Expected Behavior**:

- Operation blocks for specified timeout duration
- Returns `Ok(None)` when timeout expires (not an error)
- Is cancellable via `CancellationToken` before timeout expires
- Does not consume excessive resources while waiting

**Implementation Considerations**:

- Azure: Uses `receive_messages()` with timeout parameter
- AWS: Uses long polling with `WaitTimeSeconds`
- In-memory: Uses conditional variable with timeout

**Test Strategy**:

- Verify timeout accuracy (±100ms tolerance)
- Verify cancellation works mid-timeout
- Verify no memory leaks during long waits

---

### Edge Case 2: Message Received Multiple Times


**Scenario**: Message delivered, consumer crashes before completing, message redelivered to different consumer.

**Expected Behavior**:

- `delivery_count` increments on each delivery
- Each consumer receives valid receipt handle
- Only one consumer can complete the message
- Other receipt handles become invalid after completion

**Implementation Considerations**:

- Idempotent message processing in applications recommended
- Library tracks delivery count accurately
- Completion with invalid receipt returns specific error

**Test Strategy**:

- Simulate consumer crash scenarios
- Verify delivery count increments correctly
- Verify only first completion succeeds

---

### Edge Case 3: Message Lock Expires During Processing


**Scenario**: Consumer receives message, processing takes longer than lock duration, lock expires before completion attempted.

**Expected Behavior**:

- Message becomes visible to other consumers after lock expires
- Original consumer's `complete_message()` returns `InvalidReceipt` error
- Message `delivery_count` does not increment from lock expiration alone
- Application should handle `InvalidReceipt` gracefully

**Implementation Considerations**:

- Applications should monitor processing time relative to lock duration
- Consider extending lock if long processing expected (future feature)
- Set lock duration appropriately for workload

**Test Strategy**:

- Introduce artificial delays in processing
- Verify lock expiration detected correctly
- Verify message redelivery after lock expiration

---

### Edge Case 4: Duplicate Message IDs


**Scenario**: Two messages sent with identical message IDs (possible with different sessions or timing).

**Expected Behavior**:

- Both messages accepted and delivered independently
- Each has unique receipt handle
- Applications responsible for deduplication if needed
- Library does not enforce message ID uniqueness

**Implementation Considerations**:

- Message IDs generated by sender (application or library)
- Provider behavior varies (Azure may reject, AWS allows)
- Applications needing strict deduplication should track processed IDs

**Test Strategy**:

- Send messages with duplicate IDs
- Verify both delivered (or appropriate error returned)
- Document provider-specific behavior differences

---

### Edge Case 5: Maximum Message Size Exceeded


**Scenario**: Application attempts to send message larger than provider limit.

**Expected Behavior**:

- `send_message()` returns `Err(QueueError::MessageTooLarge)`
- Error includes actual size and maximum allowed size
- No message sent to queue
- No partial sends or truncation

**Implementation Considerations**:

- Azure: 256 KB limit (Standard), 1 MB (Premium)
- AWS: 256 KB limit
- Library may pre-validate size before sending
- Consider compression for large payloads

**Test Strategy**:

- Send messages at and above size limits
- Verify appropriate error returned
- Verify no message fragments in queue

---

## Session Processing Edge Cases


### Edge Case 6: Session Lock Stolen by Another Consumer


**Scenario**: Consumer A holds session lock, lock expires, Consumer B acquires lock, Consumer A attempts operation.

**Expected Behavior**:

- Consumer A's operation returns `SessionLockLost` error
- Consumer B's operations succeed normally
- Messages not duplicated or lost
- Session state remains consistent

**Implementation Considerations**:

- Session lock has timeout (configurable, default 5 minutes)
- Lock not automatically renewed (future feature consideration)
- Applications must handle lock loss gracefully

**Test Strategy**:

- Simulate slow processing with lock timeout
- Verify lock transfer detected correctly
- Verify no message loss during transfer

---

### Edge Case 7: Session with No Messages


**Scenario**: Consumer calls `accept_session()` on queue with no messages in any session.

**Expected Behavior**:

- Operation times out based on configured timeout
- Returns `Ok(None)` indicating no session available
- Does not block indefinitely
- No error logged (this is normal)

**Implementation Considerations**:

- Azure: Returns no session after timeout
- AWS: May need to poll FIFO queues
- Applications should handle gracefully and retry

**Test Strategy**:

- Accept session on empty queue
- Verify timeout behavior
- Verify resources cleaned up

---

### Edge Case 8: Session with Many Messages


**Scenario**: Session contains thousands of messages (e.g., burst of GitHub events for single PR).

**Expected Behavior**:

- Consumer receives messages one at a time or in batches
- Session lock maintained throughout processing
- All messages processed before session completion
- No arbitrary limits on message count per session

**Implementation Considerations**:

- Applications should process session messages in loop
- May need to extend lock for very long sessions
- Consider session timeout for applications
- Monitor session processing duration

**Test Strategy**:

- Create session with 1000+ messages
- Verify all messages delivered
- Verify session lock maintained
- Measure processing time

---

### Edge Case 9: Concurrent Session Acceptance


**Scenario**: Multiple consumers attempt to accept same session simultaneously.

**Expected Behavior**:

- Only one consumer succeeds in accepting session
- Other consumers receive error or timeout
- No duplicate processing of session messages
- Fair distribution of sessions among consumers

**Implementation Considerations**:

- Provider ensures exclusive session locks
- Library propagates lock acquisition failures appropriately
- Applications should retry with different session or timeout

**Test Strategy**:

- Launch multiple consumers simultaneously
- Verify only one acquires each session
- Verify all messages processed exactly once

---

## Connection and Network Edge Cases


### Edge Case 10: Network Partition During Send


**Scenario**: Network connection lost during `send_message()` operation.

**Expected Behavior**:

- Operation returns `Err(QueueError::ConnectionFailed)` or times out
- Unclear if message was sent (at-least-once semantics may result in duplicate)
- Application can retry send (may result in duplicate message)
- Connection automatically re-established on next operation

**Implementation Considerations**:

- Retries handled by provider SDKs
- Library timeout prevents indefinite hang
- Applications must handle potential duplicates

**Test Strategy**:

- Simulate network failures mid-operation
- Verify appropriate errors returned
- Verify connection recovery on subsequent operations

---

### Edge Case 11: Credential Expiration Mid-Session


**Scenario**: Using time-limited credentials (SAS token, temporary AWS credentials), credentials expire during processing.

**Expected Behavior**:

- Current operation may complete (if token valid when started)
- Subsequent operations return `AuthenticationFailed` error
- Library detects expired credentials and requests refresh (if possible)
- Connection re-established with new credentials

**Implementation Considerations**:

- Managed identities and IAM roles refresh automatically
- Connection strings and access keys do not expire
- Applications using temporary credentials must handle refresh

**Test Strategy**:

- Use credentials with short expiration
- Verify detection of expiration
- Verify automatic refresh where supported

---

### Edge Case 12: Provider Service Outage


**Scenario**: Azure Service Bus or AWS SQS experiences regional outage.

**Expected Behavior**:

- All operations return connection or timeout errors
- Circuit breaker pattern prevents retry storms
- Library logs errors but does not panic
- Applications can detect outage and fallback or alert

**Implementation Considerations**:

- Retry with exponential backoff
- Circuit breaker opens after N consecutive failures
- Applications should monitor provider status pages
- Consider cross-region failover (application responsibility)

**Test Strategy**:

- Simulate provider unavailability
- Verify circuit breaker behavior
- Verify no resource exhaustion from retries

---

## Error Handling Edge Cases


### Edge Case 13: Poison Message


**Scenario**: Message that consistently causes consumer to crash or throw error.

**Expected Behavior**:

- Message redelivered up to max attempts
- `delivery_count` increments on each attempt
- After max attempts, moved to DLQ automatically
- DLQ message includes error reason and context

**Implementation Considerations**:

- Max delivery count configurable (default 3-5)
- Applications should catch errors and abandon message
- Library moves to DLQ after max count exceeded
- DLQ should be monitored for poison messages

**Test Strategy**:

- Send message that triggers application error
- Verify redelivery with increasing count
- Verify automatic DLQ movement
- Verify error context preserved

---

### Edge Case 14: DLQ Full or Unavailable


**Scenario**: Attempting to dead letter message, but DLQ is full, deleted, or inaccessible.

**Expected Behavior**:

- Dead letter operation returns error
- Original message NOT lost (remains in main queue or returned to sender)
- Error logged with context
- Application alerted to DLQ issue

**Implementation Considerations**:

- Rare scenario (DLQs have same capacity as main queue)
- May indicate configuration error (wrong DLQ name)
- Provider may have fallback behavior

**Test Strategy**:

- Attempt dead letter to non-existent queue
- Verify error returned
- Verify message not lost

---

### Edge Case 15: Rapid Repeated Failures


**Scenario**: All messages failing rapidly (e.g., downstream database offline), causing continuous errors.

**Expected Behavior**:

- Circuit breaker opens after threshold failures
- Further operations fail fast without attempting provider call
- Circuit breaker closes after timeout or manual reset
- Prevents overwhelming provider or downstream services

**Implementation Considerations**:

- Circuit breaker configured with failure threshold and timeout
- Failure categories (transient vs permanent) considered
- Applications should monitor circuit breaker state
- Consider exponential backoff when closed

**Test Strategy**:

- Simulate continuous failures
- Verify circuit breaker opens
- Verify automatic recovery after timeout
- Measure impact on provider API calls

---

## Configuration Edge Cases


### Edge Case 16: Invalid Configuration


**Scenario**: Application provides invalid configuration (wrong queue name, malformed connection string).

**Expected Behavior**:

- Client creation fails during initialization
- Error message clearly indicates configuration problem
- No operations attempted with invalid config
- Application fails fast at startup (not during message processing)

**Implementation Considerations**:

- Validate configuration during client creation
- Test connection during initialization
- Provide detailed error messages for troubleshooting

**Test Strategy**:

- Provide various invalid configurations
- Verify clear error messages
- Verify no undefined behavior

---

### Edge Case 17: Missing Permissions


**Scenario**: Credentials valid but lack required permissions (e.g., can send but not receive).

**Expected Behavior**:

- Operations succeed for allowed actions
- Operations fail with `AuthorizationFailed` for disallowed actions
- Error message indicates permission issue
- Applications can handle partial functionality

**Implementation Considerations**:

- Provider authorization happens per-operation
- Library propagates authorization errors clearly
- Applications should validate permissions at startup if possible

**Test Strategy**:

- Configure credentials with limited permissions
- Verify appropriate operations succeed/fail
- Verify error messages helpful

---

### Edge Case 18: Configuration Change During Runtime


**Scenario**: Application configuration updated while running (e.g., new connection string via secret rotation).

**Expected Behavior**:

- Depends on implementation: May require restart or support hot reload
- Current recommendation: Restart application after config change
- Future consideration: Support configuration refresh

**Implementation Considerations**:

- Current implementation does not support runtime config changes
- Applications should restart after credential rotation
- Future feature: Watch for config changes and recreate client

**Test Strategy**:

- Document current behavior (restart required)
- Consider future feature for hot reload

---

## Performance Edge Cases


### Edge Case 19: High Throughput Burst


**Scenario**: Sudden burst of messages (e.g., 10,000 messages in 1 second from GitHub webhook spike).

**Expected Behavior**:

- Messages accepted and queued by provider
- Consumers process at sustainable rate
- Queue depth grows temporarily
- No message loss or errors

**Implementation Considerations**:

- Provider handles burst (queues are buffer)
- Applications should scale consumers based on queue depth
- Consider rate limiting at webhook ingestion if needed

**Test Strategy**:

- Send large batch of messages rapidly
- Monitor queue depth and processing rate
- Verify no errors or throttling

---

### Edge Case 20: Long Message Processing Time


**Scenario**: Individual message takes 10+ minutes to process (e.g., long-running GitHub workflow).

**Expected Behavior**:

- Message lock maintained throughout processing (if lock extended)
- Message not redelivered to other consumers during processing
- Processing completes successfully
- If lock not extended, message may be redelivered (see Edge Case 3)

**Implementation Considerations**:

- Lock extension not currently implemented (future feature)
- Applications should either:
  - Process quickly (< lock duration)
  - Or accept possibility of redelivery and ensure idempotence
- Configure lock duration appropriately

**Test Strategy**:

- Introduce artificial processing delays
- Verify behavior relative to lock duration
- Document recommendations for long-running processing

---

## Testing Considerations


### Coverage Goals


- Each edge case should have at least one integration test
- Critical edge cases (poison messages, lock expiration) should have multiple test scenarios
- Tests should cover all supported providers (Azure, AWS, in-memory)

### Test Environment


- Use emulators/in-memory for fast iteration
- Use real services for contract tests
- Simulate failures via chaos engineering techniques

### Documentation


- Edge cases should be documented in API docs
- Examples provided for common patterns
- Guidance for applications on handling each scenario

---

## Unhandled Scenarios (Known Limitations)


### Currently Not Supported


1. **Lock Extension**: Cannot extend message lock during long processing
   - Workaround: Set lock duration longer than expected processing time
   - Future feature consideration

2. **Message Modification**: Cannot modify message content after send
   - Workaround: Send new message and dead letter old one
   - Not typically needed

3. **Cross-Region Failover**: No automatic failover between regions
   - Workaround: Application-level failover logic
   - Complex feature, rarely needed

4. **Message Prioritization**: No priority queue support
   - Workaround: Use separate queues for priority levels
   - Not supported by all providers

5. **Transaction Support**: No distributed transactions across messages
   - Workaround: Idempotent processing and eventual consistency
   - Cloud queues don't support transactions

### Explicitly Out of Scope


1. **Message Routing**: Not a message broker, just queue abstraction
2. **Content Transformation**: Application responsibility
3. **Schema Validation**: Application responsibility
4. **Queue Provisioning**: Must be done via IaC or provider tools

---

## Recommendations for Applications


### Design for Failure


- Expect messages to be delivered multiple times (at-least-once semantics)
- Implement idempotent message handlers
- Handle all error types appropriately
- Use correlation IDs for tracing

### Handle Edge Cases Gracefully


- Set appropriate timeouts for your workload
- Monitor delivery counts and DLQ
- Test with failure injection
- Have runbooks for common failure scenarios

### Monitor and Alert


- Track message processing latency
- Alert on high DLQ rates
- Monitor queue depth trends
- Set up circuit breaker alerts

### Test Thoroughly


- Include edge case testing in CI/CD
- Perform chaos engineering in staging
- Load test with realistic scenarios
- Validate error handling paths