Skip to main content

Module error_handling

Module error_handling 

Source
Expand description

§Error Handling

d-engine returns errors through two channels depending on your integration mode:

  • Standalone (gRPC): ErrorCode in response
  • Embedded (Rust): Result<T, E> from API

Both modes use the same error categories defined in proto/error.proto.


§Error Categories

§Network Errors (1000-1999)

Connection problems. Retry with backoff.

CodeWhenAction
CONNECTION_TIMEOUTNetwork slow/unreachableRetry after 3-5s
INVALID_ADDRESSMalformed URLFix address
LEADER_CHANGEDLeader re-electedRetry with new leader (see metadata)

§Business Errors (4000-4999)

Cluster state or request issues. Handle based on error.

CodeWhenAction
NOT_LEADERWrite sent to followerRedirect to leader
CLUSTER_UNAVAILABLE< 2/3 nodes availableWait and retry
INVALID_REQUESTBad request formatFix request
STALE_OPERATIONBased on old cluster stateRefresh state and retry

§Watch Errors (5000-5999)

Watch stream lifecycle errors. Re-sync required.

CodeWhenAction
WATCH_BUFFER_OVERFLOWPer-watcher channel full, watcher canceledRe-sync via Read API, then re-register watch

WATCH_BUFFER_OVERFLOW arrives as a WatchResponse with event_type = WATCH_EVENT_TYPE_CANCELED. After receiving it, no further events will be delivered on that stream. Example handler:

while let Some(event) = stream.next().await {
    match event {
        Ok(ev) if ev.event_type == WatchEventType::Canceled as i32 => {
            // Watcher forcibly canceled by server
            warn!("Watch canceled (buffer overflow); re-syncing key {:?}", ev.key);
            let current = client.read(ev.key.clone()).await?;
            process_current_value(current);
            stream = client.watch(ev.key).await?; // re-register
        }
        Ok(ev) => handle_event(ev),
        Err(e) => return Err(e),
    }
}

§Handling in Standalone Mode

Use Leader Hint to redirect to the leader:

currentAddr := "127.0.0.1:9081"  // Start with any node

for i := 0; i < maxRedirects; i++ {
    conn, _ := grpc.NewClient(currentAddr, ...)
    client := pb.NewRaftClientServiceClient(conn)

    resp, err := client.HandleClientWrite(ctx, req)
    if err != nil {
        return err  // Network error
    }

    if resp.Error == error_pb.ErrorCode_SUCCESS {
        break  // Success
    }

    // Got NOT_LEADER - follow leader hint
    if resp.Error == error_pb.ErrorCode_NOT_LEADER && resp.Metadata != nil {
        leaderAddr := resp.Metadata.LeaderAddress
        if leaderAddr != nil && *leaderAddr != "" {
            conn.Close()
            currentAddr = *leaderAddr  // Redirect
            continue
        }
    }
}

See Quick Start for complete working example.


§Handling in Embedded Mode

Check Result:

match client.put(key, value).await {
    Ok(_) => { /* success */ }
    Err(e) => {
        // e is ClientApiError
        match e.code() {
            ErrorCode::NotLeader => { /* handle */ }
            ErrorCode::ClusterUnavailable => { /* retry */ }
            _ => { /* other errors */ }
        }
    }
}

§Handling in Standalone Mode (Rust gRPC Client)

§Leader Failover with refresh()

When the leader fails, use client.refresh() to rediscover the cluster before retrying:

// After leader failover: refresh blocks until new leader is ready
client.refresh(None).await?;

// Now safe to retry — connections point to new leader
client.put(key, value).await?;

refresh() respects ClientConfig::cluster_ready_timeout (default: 5s). Pass Some(endpoints) to update the bootstrap list for this and future refreshes.


§Leader Hint

When you get NOT_LEADER error, check metadata.LeaderAddress:

if resp.Error == error_pb.ErrorCode_NOT_LEADER && resp.Metadata != nil {
    leaderAddr := resp.Metadata.LeaderAddress  // e.g., "0.0.0.0:9082"
    leaderId := resp.Metadata.LeaderId          // e.g., "2"
    // Reconnect to leaderAddr
}

This allows immediate redirect instead of trying all nodes.


§Retry Strategy

ErrorRetry?How
CONNECTION_TIMEOUTYesExponential backoff
LEADER_CHANGEDYesImmediate with new leader
NOT_LEADERYesRedirect to leader
CLUSTER_UNAVAILABLEYesWait longer, don’t hammer
INVALID_REQUESTNoFix request first

§See Also