# Queue Runtime - Edge Cases and Failure Modes
## Overview
This document catalogs non-standard flows, edge cases, and failure scenarios that queue-runtime must handle correctly. Each case includes description, expected behavior, and testing considerations.
---
## Message Processing Edge Cases
### Edge Case 1: Empty Queue with Timeout
**Scenario**: Application calls `receive_message()` on an empty queue with a timeout.
**Expected Behavior**:
- Operation blocks for specified timeout duration
- Returns `Ok(None)` when timeout expires (not an error)
- Is cancellable via `CancellationToken` before timeout expires
- Does not consume excessive resources while waiting
**Implementation Considerations**:
- Azure: Uses `receive_messages()` with timeout parameter
- AWS: Uses long polling with `WaitTimeSeconds`
- In-memory: Uses conditional variable with timeout
**Test Strategy**:
- Verify timeout accuracy (±100ms tolerance)
- Verify cancellation works mid-timeout
- Verify no memory leaks during long waits
---
### Edge Case 2: Message Received Multiple Times
**Scenario**: Message delivered, consumer crashes before completing, message redelivered to different consumer.
**Expected Behavior**:
- `delivery_count` increments on each delivery
- Each consumer receives valid receipt handle
- Only one consumer can complete the message
- Other receipt handles become invalid after completion
**Implementation Considerations**:
- Idempotent message processing in applications recommended
- Library tracks delivery count accurately
- Completion with invalid receipt returns specific error
**Test Strategy**:
- Simulate consumer crash scenarios
- Verify delivery count increments correctly
- Verify only first completion succeeds
---
### Edge Case 3: Message Lock Expires During Processing
**Scenario**: Consumer receives message, processing takes longer than lock duration, lock expires before completion attempted.
**Expected Behavior**:
- Message becomes visible to other consumers after lock expires
- Original consumer's `complete_message()` returns `InvalidReceipt` error
- Message `delivery_count` does not increment from lock expiration alone
- Application should handle `InvalidReceipt` gracefully
**Implementation Considerations**:
- Applications should monitor processing time relative to lock duration
- Consider extending lock if long processing expected (future feature)
- Set lock duration appropriately for workload
**Test Strategy**:
- Introduce artificial delays in processing
- Verify lock expiration detected correctly
- Verify message redelivery after lock expiration
---
### Edge Case 4: Duplicate Message IDs
**Scenario**: Two messages sent with identical message IDs (possible with different sessions or timing).
**Expected Behavior**:
- Both messages accepted and delivered independently
- Each has unique receipt handle
- Applications responsible for deduplication if needed
- Library does not enforce message ID uniqueness
**Implementation Considerations**:
- Message IDs generated by sender (application or library)
- Provider behavior varies (Azure may reject, AWS allows)
- Applications needing strict deduplication should track processed IDs
**Test Strategy**:
- Send messages with duplicate IDs
- Verify both delivered (or appropriate error returned)
- Document provider-specific behavior differences
---
### Edge Case 5: Maximum Message Size Exceeded
**Scenario**: Application attempts to send message larger than provider limit.
**Expected Behavior**:
- `send_message()` returns `Err(QueueError::MessageTooLarge)`
- Error includes actual size and maximum allowed size
- No message sent to queue
- No partial sends or truncation
**Implementation Considerations**:
- Azure: 256 KB limit (Standard), 1 MB (Premium)
- AWS: 256 KB limit
- Library may pre-validate size before sending
- Consider compression for large payloads
**Test Strategy**:
- Send messages at and above size limits
- Verify appropriate error returned
- Verify no message fragments in queue
---
## Session Processing Edge Cases
### Edge Case 6: Session Lock Stolen by Another Consumer
**Scenario**: Consumer A holds session lock, lock expires, Consumer B acquires lock, Consumer A attempts operation.
**Expected Behavior**:
- Consumer A's operation returns `SessionLockLost` error
- Consumer B's operations succeed normally
- Messages not duplicated or lost
- Session state remains consistent
**Implementation Considerations**:
- Session lock has timeout (configurable, default 5 minutes)
- Lock not automatically renewed (future feature consideration)
- Applications must handle lock loss gracefully
**Test Strategy**:
- Simulate slow processing with lock timeout
- Verify lock transfer detected correctly
- Verify no message loss during transfer
---
### Edge Case 7: Session with No Messages
**Scenario**: Consumer calls `accept_session()` on queue with no messages in any session.
**Expected Behavior**:
- Operation times out based on configured timeout
- Returns `Ok(None)` indicating no session available
- Does not block indefinitely
- No error logged (this is normal)
**Implementation Considerations**:
- Azure: Returns no session after timeout
- AWS: May need to poll FIFO queues
- Applications should handle gracefully and retry
**Test Strategy**:
- Accept session on empty queue
- Verify timeout behavior
- Verify resources cleaned up
---
### Edge Case 8: Session with Many Messages
**Scenario**: Session contains thousands of messages (e.g., burst of GitHub events for single PR).
**Expected Behavior**:
- Consumer receives messages one at a time or in batches
- Session lock maintained throughout processing
- All messages processed before session completion
- No arbitrary limits on message count per session
**Implementation Considerations**:
- Applications should process session messages in loop
- May need to extend lock for very long sessions
- Consider session timeout for applications
- Monitor session processing duration
**Test Strategy**:
- Create session with 1000+ messages
- Verify all messages delivered
- Verify session lock maintained
- Measure processing time
---
### Edge Case 9: Concurrent Session Acceptance
**Scenario**: Multiple consumers attempt to accept same session simultaneously.
**Expected Behavior**:
- Only one consumer succeeds in accepting session
- Other consumers receive error or timeout
- No duplicate processing of session messages
- Fair distribution of sessions among consumers
**Implementation Considerations**:
- Provider ensures exclusive session locks
- Library propagates lock acquisition failures appropriately
- Applications should retry with different session or timeout
**Test Strategy**:
- Launch multiple consumers simultaneously
- Verify only one acquires each session
- Verify all messages processed exactly once
---
## Connection and Network Edge Cases
### Edge Case 10: Network Partition During Send
**Scenario**: Network connection lost during `send_message()` operation.
**Expected Behavior**:
- Operation returns `Err(QueueError::ConnectionFailed)` or times out
- Unclear if message was sent (at-least-once semantics may result in duplicate)
- Application can retry send (may result in duplicate message)
- Connection automatically re-established on next operation
**Implementation Considerations**:
- Retries handled by provider SDKs
- Library timeout prevents indefinite hang
- Applications must handle potential duplicates
**Test Strategy**:
- Simulate network failures mid-operation
- Verify appropriate errors returned
- Verify connection recovery on subsequent operations
---
### Edge Case 11: Credential Expiration Mid-Session
**Scenario**: Using time-limited credentials (SAS token, temporary AWS credentials), credentials expire during processing.
**Expected Behavior**:
- Current operation may complete (if token valid when started)
- Subsequent operations return `AuthenticationFailed` error
- Library detects expired credentials and requests refresh (if possible)
- Connection re-established with new credentials
**Implementation Considerations**:
- Managed identities and IAM roles refresh automatically
- Connection strings and access keys do not expire
- Applications using temporary credentials must handle refresh
**Test Strategy**:
- Use credentials with short expiration
- Verify detection of expiration
- Verify automatic refresh where supported
---
### Edge Case 12: Provider Service Outage
**Scenario**: Azure Service Bus or AWS SQS experiences regional outage.
**Expected Behavior**:
- All operations return connection or timeout errors
- Circuit breaker pattern prevents retry storms
- Library logs errors but does not panic
- Applications can detect outage and fallback or alert
**Implementation Considerations**:
- Retry with exponential backoff
- Circuit breaker opens after N consecutive failures
- Applications should monitor provider status pages
- Consider cross-region failover (application responsibility)
**Test Strategy**:
- Simulate provider unavailability
- Verify circuit breaker behavior
- Verify no resource exhaustion from retries
---
## Error Handling Edge Cases
### Edge Case 13: Poison Message
**Scenario**: Message that consistently causes consumer to crash or throw error.
**Expected Behavior**:
- Message redelivered up to max attempts
- `delivery_count` increments on each attempt
- After max attempts, moved to DLQ automatically
- DLQ message includes error reason and context
**Implementation Considerations**:
- Max delivery count configurable (default 3-5)
- Applications should catch errors and abandon message
- Library moves to DLQ after max count exceeded
- DLQ should be monitored for poison messages
**Test Strategy**:
- Send message that triggers application error
- Verify redelivery with increasing count
- Verify automatic DLQ movement
- Verify error context preserved
---
### Edge Case 14: DLQ Full or Unavailable
**Scenario**: Attempting to dead letter message, but DLQ is full, deleted, or inaccessible.
**Expected Behavior**:
- Dead letter operation returns error
- Original message NOT lost (remains in main queue or returned to sender)
- Error logged with context
- Application alerted to DLQ issue
**Implementation Considerations**:
- Rare scenario (DLQs have same capacity as main queue)
- May indicate configuration error (wrong DLQ name)
- Provider may have fallback behavior
**Test Strategy**:
- Attempt dead letter to non-existent queue
- Verify error returned
- Verify message not lost
---
### Edge Case 15: Rapid Repeated Failures
**Scenario**: All messages failing rapidly (e.g., downstream database offline), causing continuous errors.
**Expected Behavior**:
- Circuit breaker opens after threshold failures
- Further operations fail fast without attempting provider call
- Circuit breaker closes after timeout or manual reset
- Prevents overwhelming provider or downstream services
**Implementation Considerations**:
- Circuit breaker configured with failure threshold and timeout
- Failure categories (transient vs permanent) considered
- Applications should monitor circuit breaker state
- Consider exponential backoff when closed
**Test Strategy**:
- Simulate continuous failures
- Verify circuit breaker opens
- Verify automatic recovery after timeout
- Measure impact on provider API calls
---
## Configuration Edge Cases
### Edge Case 16: Invalid Configuration
**Scenario**: Application provides invalid configuration (wrong queue name, malformed connection string).
**Expected Behavior**:
- Client creation fails during initialization
- Error message clearly indicates configuration problem
- No operations attempted with invalid config
- Application fails fast at startup (not during message processing)
**Implementation Considerations**:
- Validate configuration during client creation
- Test connection during initialization
- Provide detailed error messages for troubleshooting
**Test Strategy**:
- Provide various invalid configurations
- Verify clear error messages
- Verify no undefined behavior
---
### Edge Case 17: Missing Permissions
**Scenario**: Credentials valid but lack required permissions (e.g., can send but not receive).
**Expected Behavior**:
- Operations succeed for allowed actions
- Operations fail with `AuthorizationFailed` for disallowed actions
- Error message indicates permission issue
- Applications can handle partial functionality
**Implementation Considerations**:
- Provider authorization happens per-operation
- Library propagates authorization errors clearly
- Applications should validate permissions at startup if possible
**Test Strategy**:
- Configure credentials with limited permissions
- Verify appropriate operations succeed/fail
- Verify error messages helpful
---
### Edge Case 18: Configuration Change During Runtime
**Scenario**: Application configuration updated while running (e.g., new connection string via secret rotation).
**Expected Behavior**:
- Depends on implementation: May require restart or support hot reload
- Current recommendation: Restart application after config change
- Future consideration: Support configuration refresh
**Implementation Considerations**:
- Current implementation does not support runtime config changes
- Applications should restart after credential rotation
- Future feature: Watch for config changes and recreate client
**Test Strategy**:
- Document current behavior (restart required)
- Consider future feature for hot reload
---
## Performance Edge Cases
### Edge Case 19: High Throughput Burst
**Scenario**: Sudden burst of messages (e.g., 10,000 messages in 1 second from GitHub webhook spike).
**Expected Behavior**:
- Messages accepted and queued by provider
- Consumers process at sustainable rate
- Queue depth grows temporarily
- No message loss or errors
**Implementation Considerations**:
- Provider handles burst (queues are buffer)
- Applications should scale consumers based on queue depth
- Consider rate limiting at webhook ingestion if needed
**Test Strategy**:
- Send large batch of messages rapidly
- Monitor queue depth and processing rate
- Verify no errors or throttling
---
### Edge Case 20: Long Message Processing Time
**Scenario**: Individual message takes 10+ minutes to process (e.g., long-running GitHub workflow).
**Expected Behavior**:
- Message lock maintained throughout processing (if lock extended)
- Message not redelivered to other consumers during processing
- Processing completes successfully
- If lock not extended, message may be redelivered (see Edge Case 3)
**Implementation Considerations**:
- Lock extension not currently implemented (future feature)
- Applications should either:
- Process quickly (< lock duration)
- Or accept possibility of redelivery and ensure idempotence
- Configure lock duration appropriately
**Test Strategy**:
- Introduce artificial processing delays
- Verify behavior relative to lock duration
- Document recommendations for long-running processing
---
## Testing Considerations
### Coverage Goals
- Each edge case should have at least one integration test
- Critical edge cases (poison messages, lock expiration) should have multiple test scenarios
- Tests should cover all supported providers (Azure, AWS, in-memory)
### Test Environment
- Use emulators/in-memory for fast iteration
- Use real services for contract tests
- Simulate failures via chaos engineering techniques
### Documentation
- Edge cases should be documented in API docs
- Examples provided for common patterns
- Guidance for applications on handling each scenario
---
## Unhandled Scenarios (Known Limitations)
### Currently Not Supported
1. **Lock Extension**: Cannot extend message lock during long processing
- Workaround: Set lock duration longer than expected processing time
- Future feature consideration
2. **Message Modification**: Cannot modify message content after send
- Workaround: Send new message and dead letter old one
- Not typically needed
3. **Cross-Region Failover**: No automatic failover between regions
- Workaround: Application-level failover logic
- Complex feature, rarely needed
4. **Message Prioritization**: No priority queue support
- Workaround: Use separate queues for priority levels
- Not supported by all providers
5. **Transaction Support**: No distributed transactions across messages
- Workaround: Idempotent processing and eventual consistency
- Cloud queues don't support transactions
### Explicitly Out of Scope
1. **Message Routing**: Not a message broker, just queue abstraction
2. **Content Transformation**: Application responsibility
3. **Schema Validation**: Application responsibility
4. **Queue Provisioning**: Must be done via IaC or provider tools
---
## Recommendations for Applications
### Design for Failure
- Expect messages to be delivered multiple times (at-least-once semantics)
- Implement idempotent message handlers
- Handle all error types appropriately
- Use correlation IDs for tracing
### Handle Edge Cases Gracefully
- Set appropriate timeouts for your workload
- Monitor delivery counts and DLQ
- Test with failure injection
- Have runbooks for common failure scenarios
### Monitor and Alert
- Track message processing latency
- Alert on high DLQ rates
- Monitor queue depth trends
- Set up circuit breaker alerts
### Test Thoroughly
- Include edge case testing in CI/CD
- Perform chaos engineering in staging
- Load test with realistic scenarios
- Validate error handling paths