queue-runtime 0.2.0

Multi-provider queue runtime for Queue-Keeper
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
# Queue Runtime - Edge Cases and Failure Modes


## Overview


This document catalogs non-standard flows, edge cases, and failure scenarios that queue-runtime must handle correctly. Each case includes description, expected behavior, and testing considerations.

---

## Message Processing Edge Cases


### Edge Case 1: Empty Queue with Timeout


**Scenario**: Application calls `receive_message()` on an empty queue with a timeout.

**Expected Behavior**:

- Operation blocks for specified timeout duration
- Returns `Ok(None)` when timeout expires (not an error)
- Is cancellable via `CancellationToken` before timeout expires
- Does not consume excessive resources while waiting

**Implementation Considerations**:

- Azure: Uses `receive_messages()` with timeout parameter
- AWS: Uses long polling with `WaitTimeSeconds`
- In-memory: Uses conditional variable with timeout

**Test Strategy**:

- Verify timeout accuracy (±100ms tolerance)
- Verify cancellation works mid-timeout
- Verify no memory leaks during long waits

---

### Edge Case 2: Message Received Multiple Times


**Scenario**: Message delivered, consumer crashes before completing, message redelivered to different consumer.

**Expected Behavior**:

- `delivery_count` increments on each delivery
- Each consumer receives valid receipt handle
- Only one consumer can complete the message
- Other receipt handles become invalid after completion

**Implementation Considerations**:

- Idempotent message processing in applications recommended
- Library tracks delivery count accurately
- Completion with invalid receipt returns specific error

**Test Strategy**:

- Simulate consumer crash scenarios
- Verify delivery count increments correctly
- Verify only first completion succeeds

---

### Edge Case 3: Message Lock Expires During Processing


**Scenario**: Consumer receives message, processing takes longer than lock duration, lock expires before completion attempted.

**Expected Behavior**:

- Message becomes visible to other consumers after lock expires
- Original consumer's `complete_message()` returns `InvalidReceipt` error
- Message `delivery_count` does not increment from lock expiration alone
- Application should handle `InvalidReceipt` gracefully

**Implementation Considerations**:

- Applications should monitor processing time relative to lock duration
- Consider extending lock if long processing expected (future feature)
- Set lock duration appropriately for workload

**Test Strategy**:

- Introduce artificial delays in processing
- Verify lock expiration detected correctly
- Verify message redelivery after lock expiration

---

### Edge Case 4: Duplicate Message IDs


**Scenario**: Two messages sent with identical message IDs (possible with different sessions or timing).

**Expected Behavior**:

- Both messages accepted and delivered independently
- Each has unique receipt handle
- Applications responsible for deduplication if needed
- Library does not enforce message ID uniqueness

**Implementation Considerations**:

- Message IDs generated by sender (application or library)
- Provider behavior varies (Azure may reject, AWS allows)
- Applications needing strict deduplication should track processed IDs

**Test Strategy**:

- Send messages with duplicate IDs
- Verify both delivered (or appropriate error returned)
- Document provider-specific behavior differences

---

### Edge Case 5: Maximum Message Size Exceeded


**Scenario**: Application attempts to send message larger than provider limit.

**Expected Behavior**:

- `send_message()` returns `Err(QueueError::MessageTooLarge)`
- Error includes actual size and maximum allowed size
- No message sent to queue
- No partial sends or truncation

**Implementation Considerations**:

- Azure: 256 KB limit (Standard), 1 MB (Premium)
- AWS: 256 KB limit
- Library may pre-validate size before sending
- Consider compression for large payloads

**Test Strategy**:

- Send messages at and above size limits
- Verify appropriate error returned
- Verify no message fragments in queue

---

## Session Processing Edge Cases


### Edge Case 6: Session Lock Stolen by Another Consumer


**Scenario**: Consumer A holds session lock, lock expires, Consumer B acquires lock, Consumer A attempts operation.

**Expected Behavior**:

- Consumer A's operation returns `SessionLockLost` error
- Consumer B's operations succeed normally
- Messages not duplicated or lost
- Session state remains consistent

**Implementation Considerations**:

- Session lock has timeout (configurable, default 5 minutes)
- Lock not automatically renewed (future feature consideration)
- Applications must handle lock loss gracefully

**Test Strategy**:

- Simulate slow processing with lock timeout
- Verify lock transfer detected correctly
- Verify no message loss during transfer

---

### Edge Case 7: Session with No Messages


**Scenario**: Consumer calls `accept_session()` on queue with no messages in any session.

**Expected Behavior**:

- Operation times out based on configured timeout
- Returns `Ok(None)` indicating no session available
- Does not block indefinitely
- No error logged (this is normal)

**Implementation Considerations**:

- Azure: Returns no session after timeout
- AWS: May need to poll FIFO queues
- Applications should handle gracefully and retry

**Test Strategy**:

- Accept session on empty queue
- Verify timeout behavior
- Verify resources cleaned up

---

### Edge Case 8: Session with Many Messages


**Scenario**: Session contains thousands of messages (e.g., burst of GitHub events for single PR).

**Expected Behavior**:

- Consumer receives messages one at a time or in batches
- Session lock maintained throughout processing
- All messages processed before session completion
- No arbitrary limits on message count per session

**Implementation Considerations**:

- Applications should process session messages in loop
- May need to extend lock for very long sessions
- Consider session timeout for applications
- Monitor session processing duration

**Test Strategy**:

- Create session with 1000+ messages
- Verify all messages delivered
- Verify session lock maintained
- Measure processing time

---

### Edge Case 9: Concurrent Session Acceptance


**Scenario**: Multiple consumers attempt to accept same session simultaneously.

**Expected Behavior**:

- Only one consumer succeeds in accepting session
- Other consumers receive error or timeout
- No duplicate processing of session messages
- Fair distribution of sessions among consumers

**Implementation Considerations**:

- Provider ensures exclusive session locks
- Library propagates lock acquisition failures appropriately
- Applications should retry with different session or timeout

**Test Strategy**:

- Launch multiple consumers simultaneously
- Verify only one acquires each session
- Verify all messages processed exactly once

---

## Connection and Network Edge Cases


### Edge Case 10: Network Partition During Send


**Scenario**: Network connection lost during `send_message()` operation.

**Expected Behavior**:

- Operation returns `Err(QueueError::ConnectionFailed)` or times out
- Unclear if message was sent (at-least-once semantics may result in duplicate)
- Application can retry send (may result in duplicate message)
- Connection automatically re-established on next operation

**Implementation Considerations**:

- Retries handled by provider SDKs
- Library timeout prevents indefinite hang
- Applications must handle potential duplicates

**Test Strategy**:

- Simulate network failures mid-operation
- Verify appropriate errors returned
- Verify connection recovery on subsequent operations

---

### Edge Case 11: Credential Expiration Mid-Session


**Scenario**: Using time-limited credentials (SAS token, temporary AWS credentials), credentials expire during processing.

**Expected Behavior**:

- Current operation may complete (if token valid when started)
- Subsequent operations return `AuthenticationFailed` error
- Library detects expired credentials and requests refresh (if possible)
- Connection re-established with new credentials

**Implementation Considerations**:

- Managed identities and IAM roles refresh automatically
- Connection strings and access keys do not expire
- Applications using temporary credentials must handle refresh

**Test Strategy**:

- Use credentials with short expiration
- Verify detection of expiration
- Verify automatic refresh where supported

---

### Edge Case 12: Provider Service Outage


**Scenario**: Azure Service Bus or AWS SQS experiences regional outage.

**Expected Behavior**:

- All operations return connection or timeout errors
- Circuit breaker pattern prevents retry storms
- Library logs errors but does not panic
- Applications can detect outage and fallback or alert

**Implementation Considerations**:

- Retry with exponential backoff
- Circuit breaker opens after N consecutive failures
- Applications should monitor provider status pages
- Consider cross-region failover (application responsibility)

**Test Strategy**:

- Simulate provider unavailability
- Verify circuit breaker behavior
- Verify no resource exhaustion from retries

---

## Error Handling Edge Cases


### Edge Case 13: Poison Message


**Scenario**: Message that consistently causes consumer to crash or throw error.

**Expected Behavior**:

- Message redelivered up to max attempts
- `delivery_count` increments on each attempt
- After max attempts, moved to DLQ automatically
- DLQ message includes error reason and context

**Implementation Considerations**:

- Max delivery count configurable (default 3-5)
- Applications should catch errors and abandon message
- Library moves to DLQ after max count exceeded
- DLQ should be monitored for poison messages

**Test Strategy**:

- Send message that triggers application error
- Verify redelivery with increasing count
- Verify automatic DLQ movement
- Verify error context preserved

---

### Edge Case 14: DLQ Full or Unavailable


**Scenario**: Attempting to dead letter message, but DLQ is full, deleted, or inaccessible.

**Expected Behavior**:

- Dead letter operation returns error
- Original message NOT lost (remains in main queue or returned to sender)
- Error logged with context
- Application alerted to DLQ issue

**Implementation Considerations**:

- Rare scenario (DLQs have same capacity as main queue)
- May indicate configuration error (wrong DLQ name)
- Provider may have fallback behavior

**Test Strategy**:

- Attempt dead letter to non-existent queue
- Verify error returned
- Verify message not lost

---

### Edge Case 15: Rapid Repeated Failures


**Scenario**: All messages failing rapidly (e.g., downstream database offline), causing continuous errors.

**Expected Behavior**:

- Circuit breaker opens after threshold failures
- Further operations fail fast without attempting provider call
- Circuit breaker closes after timeout or manual reset
- Prevents overwhelming provider or downstream services

**Implementation Considerations**:

- Circuit breaker configured with failure threshold and timeout
- Failure categories (transient vs permanent) considered
- Applications should monitor circuit breaker state
- Consider exponential backoff when closed

**Test Strategy**:

- Simulate continuous failures
- Verify circuit breaker opens
- Verify automatic recovery after timeout
- Measure impact on provider API calls

---

## Configuration Edge Cases


### Edge Case 16: Invalid Configuration


**Scenario**: Application provides invalid configuration (wrong queue name, malformed connection string).

**Expected Behavior**:

- Client creation fails during initialization
- Error message clearly indicates configuration problem
- No operations attempted with invalid config
- Application fails fast at startup (not during message processing)

**Implementation Considerations**:

- Validate configuration during client creation
- Test connection during initialization
- Provide detailed error messages for troubleshooting

**Test Strategy**:

- Provide various invalid configurations
- Verify clear error messages
- Verify no undefined behavior

---

### Edge Case 17: Missing Permissions


**Scenario**: Credentials valid but lack required permissions (e.g., can send but not receive).

**Expected Behavior**:

- Operations succeed for allowed actions
- Operations fail with `AuthorizationFailed` for disallowed actions
- Error message indicates permission issue
- Applications can handle partial functionality

**Implementation Considerations**:

- Provider authorization happens per-operation
- Library propagates authorization errors clearly
- Applications should validate permissions at startup if possible

**Test Strategy**:

- Configure credentials with limited permissions
- Verify appropriate operations succeed/fail
- Verify error messages helpful

---

### Edge Case 18: Configuration Change During Runtime


**Scenario**: Application configuration updated while running (e.g., new connection string via secret rotation).

**Expected Behavior**:

- Depends on implementation: May require restart or support hot reload
- Current recommendation: Restart application after config change
- Future consideration: Support configuration refresh

**Implementation Considerations**:

- Current implementation does not support runtime config changes
- Applications should restart after credential rotation
- Future feature: Watch for config changes and recreate client

**Test Strategy**:

- Document current behavior (restart required)
- Consider future feature for hot reload

---

## Performance Edge Cases


### Edge Case 19: High Throughput Burst


**Scenario**: Sudden burst of messages (e.g., 10,000 messages in 1 second from GitHub webhook spike).

**Expected Behavior**:

- Messages accepted and queued by provider
- Consumers process at sustainable rate
- Queue depth grows temporarily
- No message loss or errors

**Implementation Considerations**:

- Provider handles burst (queues are buffer)
- Applications should scale consumers based on queue depth
- Consider rate limiting at webhook ingestion if needed

**Test Strategy**:

- Send large batch of messages rapidly
- Monitor queue depth and processing rate
- Verify no errors or throttling

---

### Edge Case 20: Long Message Processing Time


**Scenario**: Individual message takes 10+ minutes to process (e.g., long-running GitHub workflow).

**Expected Behavior**:

- Message lock maintained throughout processing (if lock extended)
- Message not redelivered to other consumers during processing
- Processing completes successfully
- If lock not extended, message may be redelivered (see Edge Case 3)

**Implementation Considerations**:

- Lock extension not currently implemented (future feature)
- Applications should either:
  - Process quickly (< lock duration)
  - Or accept possibility of redelivery and ensure idempotence
- Configure lock duration appropriately

**Test Strategy**:

- Introduce artificial processing delays
- Verify behavior relative to lock duration
- Document recommendations for long-running processing

---

## Testing Considerations


### Coverage Goals


- Each edge case should have at least one integration test
- Critical edge cases (poison messages, lock expiration) should have multiple test scenarios
- Tests should cover all supported providers (Azure, AWS, in-memory)

### Test Environment


- Use emulators/in-memory for fast iteration
- Use real services for contract tests
- Simulate failures via chaos engineering techniques

### Documentation


- Edge cases should be documented in API docs
- Examples provided for common patterns
- Guidance for applications on handling each scenario

---

## Unhandled Scenarios (Known Limitations)


### Currently Not Supported


1. **Lock Extension**: Cannot extend message lock during long processing
   - Workaround: Set lock duration longer than expected processing time
   - Future feature consideration

2. **Message Modification**: Cannot modify message content after send
   - Workaround: Send new message and dead letter old one
   - Not typically needed

3. **Cross-Region Failover**: No automatic failover between regions
   - Workaround: Application-level failover logic
   - Complex feature, rarely needed

4. **Message Prioritization**: No priority queue support
   - Workaround: Use separate queues for priority levels
   - Not supported by all providers

5. **Transaction Support**: No distributed transactions across messages
   - Workaround: Idempotent processing and eventual consistency
   - Cloud queues don't support transactions

### Explicitly Out of Scope


1. **Message Routing**: Not a message broker, just queue abstraction
2. **Content Transformation**: Application responsibility
3. **Schema Validation**: Application responsibility
4. **Queue Provisioning**: Must be done via IaC or provider tools

---

## Recommendations for Applications


### Design for Failure


- Expect messages to be delivered multiple times (at-least-once semantics)
- Implement idempotent message handlers
- Handle all error types appropriately
- Use correlation IDs for tracing

### Handle Edge Cases Gracefully


- Set appropriate timeouts for your workload
- Monitor delivery counts and DLQ
- Test with failure injection
- Have runbooks for common failure scenarios

### Monitor and Alert


- Track message processing latency
- Alert on high DLQ rates
- Monitor queue depth trends
- Set up circuit breaker alerts

### Test Thoroughly


- Include edge case testing in CI/CD
- Perform chaos engineering in staging
- Load test with realistic scenarios
- Validate error handling paths