AWS Durable Execution SDK for Lambda Rust Runtime

This SDK enables Rust developers to build reliable, long-running workflows in AWS Lambda with automatic checkpointing, replay, and state management.

Overview

The AWS Durable Execution SDK provides a framework for building workflows that can survive Lambda function restarts, timeouts, and failures. It automatically checkpoints the state of your workflow, allowing it to resume exactly where it left off after any interruption.

Key Features

Automatic Checkpointing: Every operation is automatically checkpointed, ensuring your workflow can resume from the last completed step.
Replay Mechanism: When a function resumes, completed operations return their checkpointed results instantly without re-execution.
Concurrent Operations: Process collections in parallel with configurable concurrency limits and failure tolerance.
External Integration: Wait for callbacks from external systems with configurable timeouts.
Type Safety: Full Rust type safety with generics, trait-based abstractions, and newtype wrappers for domain identifiers.
Promise Combinators: Coordinate multiple durable promises with all, any, race, and all_settled.
Replay-Safe Helpers: Generate deterministic UUIDs and timestamps that are safe for replay.
Configurable Checkpointing: Choose between eager, batched, or optimistic checkpointing modes.
Trait Aliases: Cleaner function signatures with [DurableValue] and [StepFn] trait aliases.
Sealed Traits: Internal traits are sealed to allow API evolution without breaking changes.

Important Documentation

Before writing durable workflows, please read:

[docs::determinism]: Critical - Understanding determinism requirements for replay-safe workflows
[docs::limits]: Execution limits and constraints you need to know

Getting Started

Add the SDK to your Cargo.toml:

[dependencies]
durable-execution-sdk = "0.1"
tokio = { version = "1.0", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }

Basic Workflow Example

Here's a simple workflow that processes an order:

use durable_execution_sdk::{durable_execution, DurableContext, DurableError, Duration};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct OrderEvent {
    order_id: String,
    amount: f64,
}

#[derive(Serialize)]
struct OrderResult {
    status: String,
    order_id: String,
}

#[durable_execution]
async fn process_order(event: OrderEvent, ctx: DurableContext) -> Result<OrderResult, DurableError> {
    // Step 1: Validate the order (checkpointed automatically)
    let is_valid: bool = ctx.step(|_step_ctx| {
        // Validation logic here
        Ok(true)
    }, None).await?;

    if !is_valid {
        return Err(DurableError::execution("Invalid order"));
    }

    // Step 2: Process payment (checkpointed automatically)
    let payment_id: String = ctx.step(|_step_ctx| {
        // Payment processing logic here
        Ok("pay_123".to_string())
    }, None).await?;

    // Step 3: Wait for payment confirmation (suspends Lambda, resumes later)
    ctx.wait(Duration::from_seconds(5), Some("payment_confirmation")).await?;

    // Step 4: Complete the order
    Ok(OrderResult {
        status: "completed".to_string(),
        order_id: event.order_id,
    })
}

Core Concepts

DurableContext

The [DurableContext] is the main interface for durable operations. It provides:

step: Execute and checkpoint a unit of work
wait: Pause execution for a specified duration
create_callback: Wait for external systems to signal completion
invoke: Call other durable Lambda functions
map: Process collections in parallel
parallel: Execute multiple operations concurrently
run_in_child_context: Create isolated nested workflows

Steps

Steps are the fundamental unit of work in durable executions. Each step is automatically checkpointed, allowing the workflow to resume from the last completed step after interruptions.

// Simple step
let result: i32 = ctx.step(|_| Ok(42), None).await?;

// Named step for better debugging
let result: String = ctx.step_named("fetch_data", |_| {
    Ok("data".to_string())
}, None).await?;

// Step with custom configuration
use durable_execution_sdk::{StepConfig, StepSemantics};

let config = StepConfig {
    step_semantics: StepSemantics::AtMostOncePerRetry,
    ..Default::default()
};
let result: i32 = ctx.step(|_| Ok(42), Some(config)).await?;

Step Semantics

The SDK supports two execution semantics for steps:

AtLeastOncePerRetry (default): Checkpoint after execution. The step may execute multiple times if interrupted, but the result is always checkpointed.
AtMostOncePerRetry: Checkpoint before execution. Guarantees the step executes at most once per retry, useful for non-idempotent operations.

Wait Operations

Wait operations suspend the Lambda execution and resume after the specified duration. This is efficient because it doesn't block Lambda resources.

use durable_execution_sdk::Duration;

// Wait for 5 seconds
ctx.wait(Duration::from_seconds(5), None).await?;

// Wait for 1 hour with a name
ctx.wait(Duration::from_hours(1), Some("wait_for_approval")).await?;

Callbacks

Callbacks allow external systems to signal your workflow. Create a callback, share the callback ID with an external system, and wait for the result.

use durable_execution_sdk::CallbackConfig;

// Create a callback with 24-hour timeout
let callback = ctx.create_callback::<ApprovalResponse>(Some(CallbackConfig {
    timeout: Duration::from_hours(24),
    ..Default::default()
})).await?;

// Share callback.callback_id with external system
notify_approver(&callback.callback_id).await?;

// Wait for the callback result (suspends until callback is received)
let approval = callback.result().await?;

Parallel Processing

Process collections in parallel with configurable concurrency and failure tolerance:

use durable_execution_sdk::{MapConfig, CompletionConfig};

// Process items with max 5 concurrent executions
let results = ctx.map(
    vec![1, 2, 3, 4, 5],
    |child_ctx, item, index| async move {
        child_ctx.step(|_| Ok(item * 2), None).await
    },
    Some(MapConfig {
        max_concurrency: Some(5),
        completion_config: CompletionConfig::all_successful(),
        ..Default::default()
    }),
).await?;

// Get all successful results
let values = results.get_results()?;

Parallel Branches

Execute multiple independent operations concurrently:

use durable_execution_sdk::ParallelConfig;

let results = ctx.parallel(
    vec![
        |ctx| Box::pin(async move { ctx.step(|_| Ok("a"), None).await }),
        |ctx| Box::pin(async move { ctx.step(|_| Ok("b"), None).await }),
        |ctx| Box::pin(async move { ctx.step(|_| Ok("c"), None).await }),
    ],
    None,
).await?;

Promise Combinators

The SDK provides promise combinators for coordinating multiple durable operations:

use durable_execution_sdk::DurableContext;

async fn coordinate_operations(ctx: &DurableContext) -> Result<(), DurableError> {
    // Wait for ALL operations to complete successfully
    let results = ctx.all(vec![
        ctx.step(|_| Ok(1), None),
        ctx.step(|_| Ok(2), None),
        ctx.step(|_| Ok(3), None),
    ]).await?;
    // results = [1, 2, 3]

    // Wait for ALL operations to settle (success or failure)
    let batch_result = ctx.all_settled(vec![
        ctx.step(|_| Ok("success"), None),
        ctx.step(|_| Err("failure".into()), None),
    ]).await;
    // batch_result contains both success and failure outcomes

    // Return the FIRST operation to settle (success or failure)
    let first = ctx.race(vec![
        ctx.step(|_| Ok("fast"), None),
        ctx.step(|_| Ok("slow"), None),
    ]).await?;

    // Return the FIRST operation to succeed
    let first_success = ctx.any(vec![
        ctx.step(|_| Err("fail".into()), None),
        ctx.step(|_| Ok("success"), None),
    ]).await?;

    Ok(())
}

Accessing Original Input

Access the original input that started the execution:

use serde::Deserialize;

#[derive(Deserialize)]
struct OrderEvent {
    order_id: String,
    amount: f64,
}

async fn my_workflow(ctx: DurableContext) -> Result<(), DurableError> {
    // Get the original input that started this execution
    let event: OrderEvent = ctx.get_original_input()?;
    println!("Processing order: {}", event.order_id);
    
    // Or get the raw JSON string
    if let Some(raw_input) = ctx.get_original_input_raw() {
        println!("Raw input: {}", raw_input);
    }
    
    Ok(())
}

Replay-Safe Helpers

Generate deterministic values that are safe for replay:

use durable_execution_sdk::replay_safe::{
    uuid_from_operation, uuid_to_string, uuid_string_from_operation,
};

// Generate a deterministic UUID from an operation ID
let operation_id = "my-operation-123";
let uuid_bytes = uuid_from_operation(operation_id, 0);
let uuid_string = uuid_to_string(&uuid_bytes);

// Or use the convenience function
let uuid = uuid_string_from_operation(operation_id, 0);

// Same inputs always produce the same UUID
let uuid2 = uuid_string_from_operation(operation_id, 0);
assert_eq!(uuid, uuid2);

// Different seeds produce different UUIDs
let uuid3 = uuid_string_from_operation(operation_id, 1);
assert_ne!(uuid, uuid3);

For timestamps, use the execution start time instead of current time:

use durable_execution_sdk::replay_safe::{
    timestamp_from_execution, timestamp_seconds_from_execution,
};

async fn my_workflow(ctx: DurableContext) -> Result<(), DurableError> {
    // Get replay-safe timestamp (milliseconds since epoch)
    if let Some(timestamp_ms) = timestamp_from_execution(ctx.state()) {
        println!("Execution started at: {} ms", timestamp_ms);
    }
    
    // Or get seconds since epoch
    if let Some(timestamp_secs) = timestamp_seconds_from_execution(ctx.state()) {
        println!("Execution started at: {} seconds", timestamp_secs);
    }
    
    Ok(())
}

Important: See [docs::determinism] for detailed guidance on writing replay-safe code.

Wait Cancellation

Cancel an active wait operation:

async fn cancellable_workflow(ctx: DurableContext) -> Result<(), DurableError> {
    // Start a long wait in a child context
    let wait_op_id = ctx.next_operation_id();
    
    // In another branch, you can cancel the wait
    ctx.cancel_wait(&wait_op_id).await?;
    
    Ok(())
}

Extended Duration Support

The Duration type supports extended time periods:

use durable_execution_sdk::Duration;

// Standard durations
let seconds = Duration::from_seconds(30);
let minutes = Duration::from_minutes(5);
let hours = Duration::from_hours(2);
let days = Duration::from_days(7);

// Extended durations
let weeks = Duration::from_weeks(2);      // 14 days
let months = Duration::from_months(3);    // 90 days (30 days per month)
let years = Duration::from_years(1);      // 365 days

assert_eq!(weeks.to_seconds(), 14 * 24 * 60 * 60);
assert_eq!(months.to_seconds(), 90 * 24 * 60 * 60);
assert_eq!(years.to_seconds(), 365 * 24 * 60 * 60);

Type-Safe Identifiers (Newtypes)

The SDK provides newtype wrappers for domain identifiers to prevent accidental mixing of different ID types at compile time. These types are available in the [types] module and re-exported at the crate root.

Available Newtypes

[OperationId]: Unique identifier for an operation within a durable execution
[ExecutionArn]: Amazon Resource Name identifying a durable execution
[CallbackId]: Unique identifier for a callback operation

Creating Newtypes

use durable_execution_sdk::{OperationId, ExecutionArn, CallbackId};

// From String or &str (no validation, for backward compatibility)
let op_id = OperationId::from("op-123");
let op_id2: OperationId = "op-456".into();

// With validation (rejects empty strings)
let op_id3 = OperationId::new("op-789").unwrap();
assert!(OperationId::new("").is_err());

// ExecutionArn validates ARN format
let arn = ExecutionArn::new("arn:aws:lambda:us-east-1:123456789012:function:my-func:durable:abc123");
assert!(arn.is_ok());

// CallbackId for external system integration
let callback_id = CallbackId::from("callback-xyz");

Using Newtypes as Strings

All newtypes implement Deref<Target=str> and AsRef<str> for convenient string access:

use durable_execution_sdk::OperationId;

let op_id = OperationId::from("op-123");

// Use string methods directly via Deref
assert!(op_id.starts_with("op-"));
assert_eq!(op_id.len(), 6);

// Use as &str via AsRef
let s: &str = op_id.as_ref();
assert_eq!(s, "op-123");

Newtypes in Collections

All newtypes implement Hash and Eq, making them suitable for use in HashMap and HashSet:

use durable_execution_sdk::OperationId;
use std::collections::HashMap;

let mut results: HashMap<OperationId, String> = HashMap::new();
results.insert(OperationId::from("op-1"), "success".to_string());
results.insert(OperationId::from("op-2"), "pending".to_string());

assert_eq!(results.get(&OperationId::from("op-1")), Some(&"success".to_string()));

Serialization

All newtypes use #[serde(transparent)] for seamless JSON serialization:

use durable_execution_sdk::OperationId;

let op_id = OperationId::from("op-123");
let json = serde_json::to_string(&op_id).unwrap();
assert_eq!(json, "\"op-123\""); // Serializes as plain string

let restored: OperationId = serde_json::from_str(&json).unwrap();
assert_eq!(restored, op_id);

Trait Aliases

The SDK provides trait aliases to simplify common trait bound combinations. These are available in the [traits] module and re-exported at the crate root.

DurableValue

[DurableValue] is a trait alias for types that can be durably stored and retrieved:

use durable_execution_sdk::DurableValue;
use serde::{Deserialize, Serialize};

// DurableValue is equivalent to: Serialize + DeserializeOwned + Send

// Any type implementing these traits automatically implements DurableValue
#[derive(Debug, Clone, Serialize, Deserialize)]
struct OrderResult {
    order_id: String,
    status: String,
}

// Use in generic functions for cleaner signatures
fn process_result<T: DurableValue>(result: T) -> String {
    serde_json::to_string(&result).unwrap_or_default()
}

StepFn

[StepFn] is a trait alias for step function closures:

use durable_execution_sdk::{StepFn, DurableValue};
use durable_execution_sdk::handlers::StepContext;

// StepFn<T> is equivalent to:
// FnOnce(StepContext) -> Result<T, Box<dyn Error + Send + Sync>> + Send

// Use in generic functions
fn execute_step<T: DurableValue, F: StepFn<T>>(func: F) {
    // func can be called with a StepContext
}

// Works with closures
execute_step(|_ctx| Ok(42i32));

// Works with named functions
fn my_step(ctx: StepContext) -> Result<String, Box<dyn std::error::Error + Send + Sync>> {
    Ok(format!("Processed by {}", ctx.operation_id))
}
execute_step(my_step);

Sealed Traits and Factory Functions

Some SDK traits are "sealed" - they cannot be implemented outside this crate. This allows the SDK to evolve without breaking external code. Sealed traits include:

[Logger]: For structured logging in durable executions
[SerDes]: For custom serialization/deserialization

Custom Loggers

Instead of implementing Logger directly, use the factory functions:

use durable_execution_sdk::{custom_logger, simple_custom_logger, LogInfo};

// Simple custom logger with a single function for all levels
let logger = simple_custom_logger(|level, msg, info| {
    println!("[{}] {}: {:?}", level, msg, info);
});

// Full custom logger with separate functions for each level
let logger = custom_logger(
    |msg, info| println!("[DEBUG] {}", msg),  // debug
    |msg, info| println!("[INFO] {}", msg),   // info
    |msg, info| println!("[WARN] {}", msg),   // warn
    |msg, info| println!("[ERROR] {}", msg),  // error
);

Custom Serializers

Instead of implementing SerDes directly, use the factory function:

use durable_execution_sdk::serdes::{custom_serdes, SerDesContext, SerDesError};

// Create a custom serializer for a specific type
let serdes = custom_serdes::<String, _, _>(
    |value, _ctx| Ok(format!("custom:{}", value)),  // serialize
    |data, _ctx| {                                   // deserialize
        data.strip_prefix("custom:")
            .map(|s| s.to_string())
            .ok_or_else(|| SerDesError::deserialization("Invalid format"))
    },
);

Configuration Types

The SDK provides type-safe configuration for all operations:

[StepConfig]: Configure retry strategy, execution semantics, and serialization
[CallbackConfig]: Configure timeout and heartbeat for callbacks
[InvokeConfig]: Configure timeout and serialization for function invocations
[MapConfig]: Configure concurrency, batching, and completion criteria for map operations
[ParallelConfig]: Configure concurrency and completion criteria for parallel operations
[CompletionConfig]: Define success/failure criteria for concurrent operations

Completion Configuration

Control when concurrent operations complete:

use durable_execution_sdk::CompletionConfig;

// Complete when first task succeeds
let first = CompletionConfig::first_successful();

// Wait for all tasks to complete (regardless of success/failure)
let all = CompletionConfig::all_completed();

// Require all tasks to succeed (zero failure tolerance)
let strict = CompletionConfig::all_successful();

// Custom: require at least 3 successes
let custom = CompletionConfig::with_min_successful(3);

Error Handling

The SDK provides a comprehensive error hierarchy through [DurableError]:

Execution: Errors that return FAILED status without Lambda retry
Invocation: Errors that trigger Lambda retry
Checkpoint: Checkpoint failures (retriable or non-retriable)
Callback: Callback-specific failures
NonDeterministic: Replay mismatches (operation type changed between runs)
Validation: Invalid configuration or arguments
SerDes: Serialization/deserialization failures
Suspend: Signal to pause execution and return control to Lambda

use durable_execution_sdk::DurableError;

// Create specific error types
let exec_error = DurableError::execution("Something went wrong");
let validation_error = DurableError::validation("Invalid input");

// Check error properties
if error.is_retriable() {
    // Handle retriable error
}

Custom Serialization

The SDK uses JSON serialization by default, but you can provide custom serializers by implementing the [SerDes] trait:

use durable_execution_sdk::serdes::{SerDes, SerDesContext, SerDesError};

struct MyCustomSerDes;

impl SerDes<MyType> for MyCustomSerDes {
    fn serialize(&self, value: &MyType, context: &SerDesContext) -> Result<String, SerDesError> {
        // Custom serialization logic
        Ok(format!("{:?}", value))
    }

    fn deserialize(&self, data: &str, context: &SerDesContext) -> Result<MyType, SerDesError> {
        // Custom deserialization logic
        todo!()
    }
}

Logging and Tracing

The SDK integrates with the tracing crate for structured logging. All operations automatically include execution context (ARN, operation ID, parent ID) in log messages.

For detailed guidance on configuring tracing for Lambda, log correlation, and best practices, see the TRACING.md documentation.

Simplified Logging API

The [DurableContext] provides convenience methods for logging with automatic context:

use durable_execution_sdk::{DurableContext, DurableError};

async fn my_workflow(ctx: DurableContext) -> Result<(), DurableError> {
    // Basic logging - context is automatically included
    ctx.log_info("Starting order processing");
    ctx.log_debug("Validating input parameters");
    ctx.log_warn("Retry attempt 2 of 5");
    ctx.log_error("Failed to process payment");
    
    // Logging with extra fields for filtering
    ctx.log_info_with("Processing order", &[
        ("order_id", "ORD-12345"),
        ("customer_id", "CUST-789"),
    ]);
    
    ctx.log_error_with("Payment failed", &[
        ("error_code", "INSUFFICIENT_FUNDS"),
        ("amount", "150.00"),
    ]);
    
    Ok(())
}

All logging methods automatically include:

durable_execution_arn: The execution ARN for correlation
parent_id: The parent operation ID (for nested operations)
is_replay: Whether the operation is being replayed

Extra Fields in Log Output

Extra fields passed to log_*_with methods are included in the tracing output as key-value pairs, making them queryable in log aggregation systems like CloudWatch:

// This log message...
ctx.log_info_with("Order event", &[
    ("event_type", "ORDER_CREATED"),
    ("order_id", "ORD-123"),
]);

// ...produces JSON output like:
// {
//   "message": "Order event",
//   "durable_execution_arn": "arn:aws:...",
//   "extra": "event_type=ORDER_CREATED, order_id=ORD-123",
//   ...
// }

Replay-Aware Logging

The SDK supports replay-aware logging that can suppress or filter logs during replay. This is useful to reduce noise when replaying previously executed operations.

use durable_execution_sdk::{TracingLogger, ReplayAwareLogger, ReplayLoggingConfig};
use std::sync::Arc;

// Suppress all logs during replay (default)
let logger = ReplayAwareLogger::suppress_replay(Arc::new(TracingLogger));

// Allow only errors during replay
let logger_errors = ReplayAwareLogger::new(
    Arc::new(TracingLogger),
    ReplayLoggingConfig::ErrorsOnly,
);

// Allow all logs during replay
let logger_all = ReplayAwareLogger::allow_all(Arc::new(TracingLogger));

Custom Logger

You can also provide a custom logger using the factory functions:

use durable_execution_sdk::{custom_logger, simple_custom_logger, LogInfo};

// Simple custom logger with a single function for all levels
let logger = simple_custom_logger(|level, msg, info| {
    println!("[{}] {}: {:?}", level, msg, info);
});

// Full custom logger with separate functions for each level
let logger = custom_logger(
    |msg, info| println!("[DEBUG] {}", msg),  // debug
    |msg, info| println!("[INFO] {}", msg),   // info
    |msg, info| println!("[WARN] {}", msg),   // warn
    |msg, info| println!("[ERROR] {}", msg),  // error
);

Duration Type

The SDK provides a [Duration] type with convenient constructors:

use durable_execution_sdk::Duration;

let five_seconds = Duration::from_seconds(5);
let two_minutes = Duration::from_minutes(2);
let one_hour = Duration::from_hours(1);
let one_day = Duration::from_days(1);

assert_eq!(five_seconds.to_seconds(), 5);
assert_eq!(two_minutes.to_seconds(), 120);
assert_eq!(one_hour.to_seconds(), 3600);
assert_eq!(one_day.to_seconds(), 86400);

Thread Safety

The SDK is designed for use in async Rust with Tokio. All core types are Send + Sync and can be safely shared across async tasks:

[DurableContext] uses Arc for shared state
[ExecutionState] uses RwLock and atomic operations for thread-safe access
Operation ID generation uses atomic counters

Best Practices

Keep steps small and focused: Each step should do one thing well. This makes debugging easier and reduces the impact of failures.
Use named operations: Named steps and waits make logs and debugging much easier to understand.
Handle errors appropriately: Use DurableError::execution for errors that should fail the workflow, and DurableError::invocation for errors that should trigger a retry.
Consider idempotency: For operations that may be retried, ensure they are idempotent or use AtMostOncePerRetry semantics.
Use appropriate concurrency limits: When using map or parallel, set max_concurrency to avoid overwhelming downstream services.
Set reasonable timeouts: Always configure timeouts for callbacks and invocations to prevent workflows from hanging indefinitely.
Ensure determinism: Your workflow must execute the same sequence of operations on every run. Avoid using HashMap iteration, random numbers, or current time outside of steps. See [docs::determinism] for details.
Use replay-safe helpers: When you need UUIDs or timestamps, use the helpers in [replay_safe] to ensure consistent values across replays.
Use type-safe identifiers: Prefer [OperationId], [ExecutionArn], and [CallbackId] over raw strings to catch type mismatches at compile time.
Use trait aliases: Use [DurableValue] and [StepFn] in your generic functions for cleaner, more maintainable signatures.

Result Type Aliases

The SDK provides semantic result type aliases for cleaner function signatures:

DurableResult<T>: Alias for Result<T, DurableError> - general durable operations
StepResult<T>: Alias for Result<T, DurableError> - step operation results
CheckpointResult<T>: Alias for Result<T, DurableError> - checkpoint operation results

use durable_execution_sdk::{DurableResult, StepResult, DurableError};

// Use in function signatures for clarity
fn process_order(order_id: &str) -> DurableResult<String> {
    Ok(format!("Processed: {}", order_id))
}

fn execute_step() -> StepResult<i32> {
    Ok(42)
}

Module Organization

[client]: Lambda service client for checkpoint operations
[concurrency]: Concurrent execution types (BatchResult, ConcurrentExecutor)
[config]: Configuration types for all operations
[context]: DurableContext, operation identifiers, and logging (includes factory functions [custom_logger] and [simple_custom_logger] for creating custom loggers)
[docs]: Documentation modules - determinism requirements and execution limits
- [docs::determinism]: Understanding determinism for replay-safe workflows
- [docs::limits]: Execution limits and constraints
[duration]: Duration type with convenient constructors
[error]: Error types, error handling, and result type aliases ([DurableResult], [StepResult], [CheckpointResult])
[handlers]: Operation handlers (step, wait, callback, etc.)
[lambda]: Lambda integration types (input/output)
[operation]: Operation types and status enums (optimized with #[repr(u8)] for compact memory layout)
[replay_safe]: Replay-safe helpers for deterministic UUIDs and timestamps
[serdes]: Serialization/deserialization system (includes [custom_serdes] factory function)
[state]: Execution state and checkpointing system
[traits]: Trait aliases for common bounds ([DurableValue], [StepFn])
[types]: Type-safe newtype wrappers for domain identifiers ([OperationId], [ExecutionArn], [CallbackId])

durable-execution-sdk 0.1.0-alpha1