AWS Durable Execution SDK for Lambda Rust Runtime
This SDK enables Rust developers to build reliable, long-running workflows in AWS Lambda with automatic checkpointing, replay, and state management.
Overview
The AWS Durable Execution SDK provides a framework for building workflows that can survive Lambda function restarts, timeouts, and failures. It automatically checkpoints the state of your workflow, allowing it to resume exactly where it left off after any interruption.
Key Features
- Automatic Checkpointing: Every operation is automatically checkpointed, ensuring your workflow can resume from the last completed step.
- Replay Mechanism: When a function resumes, completed operations return their checkpointed results instantly without re-execution.
- Concurrent Operations: Process collections in parallel with configurable concurrency limits and failure tolerance.
- External Integration: Wait for callbacks from external systems with configurable timeouts.
- Type Safety: Full Rust type safety with generics, trait-based abstractions, and newtype wrappers for domain identifiers.
- Promise Combinators: Coordinate multiple durable promises with
all,any,race, andall_settled. - Replay-Safe Helpers: Generate deterministic UUIDs and timestamps that are safe for replay.
- Configurable Checkpointing: Choose between eager, batched, or optimistic checkpointing modes.
- Trait Aliases: Cleaner function signatures with [
DurableValue] and [StepFn] trait aliases. - Sealed Traits: Internal traits are sealed to allow API evolution without breaking changes.
Important Documentation
Before writing durable workflows, please read:
- [
docs::determinism]: Critical - Understanding determinism requirements for replay-safe workflows - [
docs::limits]: Execution limits and constraints you need to know
Getting Started
Add the SDK to your Cargo.toml:
[]
= "0.1"
= { = "1.0", = ["full"] }
= { = "1.0", = ["derive"] }
Basic Workflow Example
Here's a simple workflow that processes an order:
use ;
use ;
async
Core Concepts
DurableContext
The [DurableContext] is the main interface for durable operations. It provides:
step: Execute and checkpoint a unit of workwait: Pause execution for a specified durationcreate_callback: Wait for external systems to signal completioninvoke: Call other durable Lambda functionsmap: Process collections in parallelparallel: Execute multiple operations concurrentlyrun_in_child_context: Create isolated nested workflows
Steps
Steps are the fundamental unit of work in durable executions. Each step is automatically checkpointed, allowing the workflow to resume from the last completed step after interruptions.
// Simple step
let result: i32 = ctx.step.await?;
// Named step for better debugging
let result: String = ctx.step_named.await?;
// Step with custom configuration
use ;
let config = StepConfig ;
let result: i32 = ctx.step.await?;
Step Semantics
The SDK supports two execution semantics for steps:
- AtLeastOncePerRetry (default): Checkpoint after execution. The step may execute multiple times if interrupted, but the result is always checkpointed.
- AtMostOncePerRetry: Checkpoint before execution. Guarantees the step executes at most once per retry, useful for non-idempotent operations.
Wait Operations
Wait operations suspend the Lambda execution and resume after the specified duration. This is efficient because it doesn't block Lambda resources.
use Duration;
// Wait for 5 seconds
ctx.wait.await?;
// Wait for 1 hour with a name
ctx.wait.await?;
Callbacks
Callbacks allow external systems to signal your workflow. Create a callback, share the callback ID with an external system, and wait for the result.
use CallbackConfig;
// Create a callback with 24-hour timeout
let callback = ctx..await?;
// Share callback.callback_id with external system
notify_approver.await?;
// Wait for the callback result (suspends until callback is received)
let approval = callback.result.await?;
Parallel Processing
Process collections in parallel with configurable concurrency and failure tolerance:
use ;
// Process items with max 5 concurrent executions
let results = ctx.map.await?;
// Get all successful results
let values = results.get_results?;
Parallel Branches
Execute multiple independent operations concurrently:
use ParallelConfig;
let results = ctx.parallel.await?;
Promise Combinators
The SDK provides promise combinators for coordinating multiple durable operations:
use DurableContext;
async
Accessing Original Input
Access the original input that started the execution:
use Deserialize;
async
Replay-Safe Helpers
Generate deterministic values that are safe for replay:
use ;
// Generate a deterministic UUID from an operation ID
let operation_id = "my-operation-123";
let uuid_bytes = uuid_from_operation;
let uuid_string = uuid_to_string;
// Or use the convenience function
let uuid = uuid_string_from_operation;
// Same inputs always produce the same UUID
let uuid2 = uuid_string_from_operation;
assert_eq!;
// Different seeds produce different UUIDs
let uuid3 = uuid_string_from_operation;
assert_ne!;
For timestamps, use the execution start time instead of current time:
use ;
async
Important: See [docs::determinism] for detailed guidance on writing replay-safe code.
Wait Cancellation
Cancel an active wait operation:
async
Extended Duration Support
The Duration type supports extended time periods:
use Duration;
// Standard durations
let seconds = from_seconds;
let minutes = from_minutes;
let hours = from_hours;
let days = from_days;
// Extended durations
let weeks = from_weeks; // 14 days
let months = from_months; // 90 days (30 days per month)
let years = from_years; // 365 days
assert_eq!;
assert_eq!;
assert_eq!;
Type-Safe Identifiers (Newtypes)
The SDK provides newtype wrappers for domain identifiers to prevent accidental
mixing of different ID types at compile time. These types are available in the
[types] module and re-exported at the crate root.
Available Newtypes
- [
OperationId]: Unique identifier for an operation within a durable execution - [
ExecutionArn]: Amazon Resource Name identifying a durable execution - [
CallbackId]: Unique identifier for a callback operation
Creating Newtypes
use ;
// From String or &str (no validation, for backward compatibility)
let op_id = from;
let op_id2: OperationId = "op-456".into;
// With validation (rejects empty strings)
let op_id3 = new.unwrap;
assert!;
// ExecutionArn validates ARN format
let arn = new;
assert!;
// CallbackId for external system integration
let callback_id = from;
Using Newtypes as Strings
All newtypes implement Deref<Target=str> and AsRef<str> for convenient string access:
use OperationId;
let op_id = from;
// Use string methods directly via Deref
assert!;
assert_eq!;
// Use as &str via AsRef
let s: &str = op_id.as_ref;
assert_eq!;
Newtypes in Collections
All newtypes implement Hash and Eq, making them suitable for use in HashMap and HashSet:
use OperationId;
use HashMap;
let mut results: = new;
results.insert;
results.insert;
assert_eq!;
Serialization
All newtypes use #[serde(transparent)] for seamless JSON serialization:
use OperationId;
let op_id = from;
let json = to_string.unwrap;
assert_eq!; // Serializes as plain string
let restored: OperationId = from_str.unwrap;
assert_eq!;
Trait Aliases
The SDK provides trait aliases to simplify common trait bound combinations.
These are available in the [traits] module and re-exported at the crate root.
DurableValue
[DurableValue] is a trait alias for types that can be durably stored and retrieved:
use DurableValue;
use ;
// DurableValue is equivalent to: Serialize + DeserializeOwned + Send
// Any type implementing these traits automatically implements DurableValue
// Use in generic functions for cleaner signatures
StepFn
[StepFn] is a trait alias for step function closures:
use ;
use StepContext;
// StepFn<T> is equivalent to:
// FnOnce(StepContext) -> Result<T, Box<dyn Error + Send + Sync>> + Send
// Use in generic functions
// Works with closures
execute_step;
// Works with named functions
execute_step;
Sealed Traits and Factory Functions
Some SDK traits are "sealed" - they cannot be implemented outside this crate. This allows the SDK to evolve without breaking external code. Sealed traits include:
- [
Logger]: For structured logging in durable executions - [
SerDes]: For custom serialization/deserialization
Custom Loggers
Instead of implementing Logger directly, use the factory functions:
use ;
// Simple custom logger with a single function for all levels
let logger = simple_custom_logger;
// Full custom logger with separate functions for each level
let logger = custom_logger;
Custom Serializers
Instead of implementing SerDes directly, use the factory function:
use ;
// Create a custom serializer for a specific type
let serdes = ;
Configuration Types
The SDK provides type-safe configuration for all operations:
- [
StepConfig]: Configure retry strategy, execution semantics, and serialization - [
CallbackConfig]: Configure timeout and heartbeat for callbacks - [
InvokeConfig]: Configure timeout and serialization for function invocations - [
MapConfig]: Configure concurrency, batching, and completion criteria for map operations - [
ParallelConfig]: Configure concurrency and completion criteria for parallel operations - [
CompletionConfig]: Define success/failure criteria for concurrent operations
Completion Configuration
Control when concurrent operations complete:
use CompletionConfig;
// Complete when first task succeeds
let first = first_successful;
// Wait for all tasks to complete (regardless of success/failure)
let all = all_completed;
// Require all tasks to succeed (zero failure tolerance)
let strict = all_successful;
// Custom: require at least 3 successes
let custom = with_min_successful;
Error Handling
The SDK provides a comprehensive error hierarchy through [DurableError]:
- Execution: Errors that return FAILED status without Lambda retry
- Invocation: Errors that trigger Lambda retry
- Checkpoint: Checkpoint failures (retriable or non-retriable)
- Callback: Callback-specific failures
- NonDeterministic: Replay mismatches (operation type changed between runs)
- Validation: Invalid configuration or arguments
- SerDes: Serialization/deserialization failures
- Suspend: Signal to pause execution and return control to Lambda
use DurableError;
// Create specific error types
let exec_error = execution;
let validation_error = validation;
// Check error properties
if error.is_retriable
Custom Serialization
The SDK uses JSON serialization by default, but you can provide custom
serializers by implementing the [SerDes] trait:
use ;
;
Logging and Tracing
The SDK integrates with the tracing crate for structured logging. All operations
automatically include execution context (ARN, operation ID, parent ID) in log messages.
For detailed guidance on configuring tracing for Lambda, log correlation, and best practices, see the TRACING.md documentation.
Simplified Logging API
The [DurableContext] provides convenience methods for logging with automatic context:
use ;
async
All logging methods automatically include:
durable_execution_arn: The execution ARN for correlationparent_id: The parent operation ID (for nested operations)is_replay: Whether the operation is being replayed
Extra Fields in Log Output
Extra fields passed to log_*_with methods are included in the tracing output
as key-value pairs, making them queryable in log aggregation systems like CloudWatch:
// This log message...
ctx.log_info_with;
// ...produces JSON output like:
// {
// "message": "Order event",
// "durable_execution_arn": "arn:aws:...",
// "extra": "event_type=ORDER_CREATED, order_id=ORD-123",
// ...
// }
Replay-Aware Logging
The SDK supports replay-aware logging that can suppress or filter logs during replay. This is useful to reduce noise when replaying previously executed operations.
use ;
use Arc;
// Suppress all logs during replay (default)
let logger = suppress_replay;
// Allow only errors during replay
let logger_errors = new;
// Allow all logs during replay
let logger_all = allow_all;
Custom Logger
You can also provide a custom logger using the factory functions:
use ;
// Simple custom logger with a single function for all levels
let logger = simple_custom_logger;
// Full custom logger with separate functions for each level
let logger = custom_logger;
Duration Type
The SDK provides a [Duration] type with convenient constructors:
use Duration;
let five_seconds = from_seconds;
let two_minutes = from_minutes;
let one_hour = from_hours;
let one_day = from_days;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
Thread Safety
The SDK is designed for use in async Rust with Tokio. All core types are
Send + Sync and can be safely shared across async tasks:
- [
DurableContext] usesArcfor shared state - [
ExecutionState] usesRwLockand atomic operations for thread-safe access - Operation ID generation uses atomic counters
Best Practices
-
Keep steps small and focused: Each step should do one thing well. This makes debugging easier and reduces the impact of failures.
-
Use named operations: Named steps and waits make logs and debugging much easier to understand.
-
Handle errors appropriately: Use
DurableError::executionfor errors that should fail the workflow, andDurableError::invocationfor errors that should trigger a retry. -
Consider idempotency: For operations that may be retried, ensure they are idempotent or use
AtMostOncePerRetrysemantics. -
Use appropriate concurrency limits: When using
maporparallel, setmax_concurrencyto avoid overwhelming downstream services. -
Set reasonable timeouts: Always configure timeouts for callbacks and invocations to prevent workflows from hanging indefinitely.
-
Ensure determinism: Your workflow must execute the same sequence of operations on every run. Avoid using
HashMapiteration, random numbers, or current time outside of steps. See [docs::determinism] for details. -
Use replay-safe helpers: When you need UUIDs or timestamps, use the helpers in [
replay_safe] to ensure consistent values across replays. -
Use type-safe identifiers: Prefer [
OperationId], [ExecutionArn], and [CallbackId] over raw strings to catch type mismatches at compile time. -
Use trait aliases: Use [
DurableValue] and [StepFn] in your generic functions for cleaner, more maintainable signatures.
Result Type Aliases
The SDK provides semantic result type aliases for cleaner function signatures:
DurableResult<T>: Alias forResult<T, DurableError>- general durable operationsStepResult<T>: Alias forResult<T, DurableError>- step operation resultsCheckpointResult<T>: Alias forResult<T, DurableError>- checkpoint operation results
use ;
// Use in function signatures for clarity
Module Organization
- [
client]: Lambda service client for checkpoint operations - [
concurrency]: Concurrent execution types (BatchResult, ConcurrentExecutor) - [
config]: Configuration types for all operations - [
context]: DurableContext, operation identifiers, and logging (includes factory functions [custom_logger] and [simple_custom_logger] for creating custom loggers) - [
docs]: Documentation modules - determinism requirements and execution limits- [
docs::determinism]: Understanding determinism for replay-safe workflows - [
docs::limits]: Execution limits and constraints
- [
- [
duration]: Duration type with convenient constructors - [
error]: Error types, error handling, and result type aliases ([DurableResult], [StepResult], [CheckpointResult]) - [
handlers]: Operation handlers (step, wait, callback, etc.) - [
lambda]: Lambda integration types (input/output) - [
operation]: Operation types and status enums (optimized with#[repr(u8)]for compact memory layout) - [
replay_safe]: Replay-safe helpers for deterministic UUIDs and timestamps - [
serdes]: Serialization/deserialization system (includes [custom_serdes] factory function) - [
state]: Execution state and checkpointing system - [
traits]: Trait aliases for common bounds ([DurableValue], [StepFn]) - [
types]: Type-safe newtype wrappers for domain identifiers ([OperationId], [ExecutionArn], [CallbackId])