Skip to main content

Module distributed_join

Module distributed_join 

Source
Expand description

Distributed join execution: broadcast and shuffle joins across cluster nodes.

Broadcast join: Control Plane serializes the small side (< 8 MiB), sends it to all relevant nodes via QUIC transport. Each node performs a local hash join with its local large-side data.

Shuffle join: each node scans its local data, hashes on the join key, routes rows to the owning node via QUIC transport. The target node performs a local hash join on the repartitioned data.

Both strategies use Arrow IPC for zero-copy batched data movement.

Structs§

BroadcastJoinRequest
A broadcast join request sent to each participating node.
ShufflePartition
A shuffle join partition assignment.

Enums§

JoinSide
JoinStrategy
Selected distributed join strategy.

Constants§

DEFAULT_BROADCAST_THRESHOLD_BYTES
Default maximum payload size for broadcast join strategy selection.

Functions§

estimate_collection_bytes
Estimate the serialized size of a collection’s data.
partition_for_key
Compute which node owns a given partition (based on join key hash).
plan_shuffle_partitions
Plan the node assignments for a shuffle join.
select_strategy
Determine the join strategy based on estimated data sizes.