PiecewiseMergeJoinExec

Struct PiecewiseMergeJoinExec 

Source
pub struct PiecewiseMergeJoinExec {
    pub buffered: Arc<dyn ExecutionPlan>,
    pub streamed: Arc<dyn ExecutionPlan>,
    pub on: (Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>),
    pub operator: Operator,
    pub join_type: JoinType,
    /* private fields */
}
Expand description

PiecewiseMergeJoinExec is a join execution plan that only evaluates single range filter and show much better performance for these workloads than NestedLoopJoin

The physical planner will choose to evaluate this join when there is only one comparison filter. This is a binary expression which contains Operator::Lt, Operator::LtEq, Operator::Gt, and Operator::GtEq.: Examples:

  • col0 < colb, col0 <= colb, col0 > colb, col0 >= colb

§Execution Plan Inputs

For PiecewiseMergeJoin we label all right inputs as the `streamed’ side and the left outputs as the ‘buffered’ side.

PiecewiseMergeJoin takes a sorted input for the side to be buffered and is able to sort streamed record batches during processing. Sorted input must specifically be ascending/descending based on the operator.

§Algorithms

Classic joins are processed differently compared to existence joins.

§Classic Joins (Inner, Full, Left, Right)

For classic joins we buffer the build side and stream the probe side (the “probe” side). Both sides are sorted so that we can iterate from index 0 to the end on each side. This ordering ensures that when we find the first matching pair of rows, we can emit the current stream row joined with all remaining probe rows from the match position onward, without rescanning earlier probe rows.

For < and <= operators, both inputs are sorted in descending order, while for > and >= operators they are sorted in ascending order. This choice ensures that the pointer on the buffered side can advance monotonically as we stream new batches from the stream side.

The streamed side may arrive unsorted, so this operator sorts each incoming batch in memory before processing. The buffered side is required to be globally sorted; the plan declares this requirement in requires_input_order, which allows the optimizer to automatically insert a SortExec on that side if needed. By the time this operator runs, the buffered side is guaranteed to be in the proper order.

The pseudocode for the algorithm looks like this:

for stream_row in stream_batch:
    for buffer_row in buffer_batch:
        if compare(stream_row, probe_row):
            output stream_row X buffer_batch[buffer_row:]
        else:
            continue

The algorithm uses the streamed side (larger) to drive the loop. This is due to every row on the stream side iterating the buffered side to find every first match. By doing this, each match can output more result so that output handling can be better vectorized for performance.

Here is an example:

We perform a JoinType::Left with these two batches and the operator being Operator::Lt(<). For each row on the streamed side we move a pointer on the buffered until it matches the condition. Once we reach the row which matches (in this case with row 1 on streamed will have its first match on row 2 on buffered; 100 < 200 is true), we can emit all rows after that match. We can emit the rows like this because if the batch is sorted in ascending order, every subsequent row will also satisfy the condition as they will all be larger values.

SQL statement:
SELECT *
FROM (VALUES (100), (200), (500)) AS streamed(a)
LEFT JOIN (VALUES (100), (200), (200), (300), (400)) AS buffered(b)
  ON streamed.a < buffered.b;

Processing Row 1:

      Sorted Buffered Side                                         Sorted Streamed Side          
      ┌──────────────────┐                                         ┌──────────────────┐         
    1 │       100        │                                       1 │       100        │        
      ├──────────────────┤                                         ├──────────────────┤         
    2 │       200        │ ─┐                                    2 │       200        │        
      ├──────────────────┤  │  For row 1 on streamed side with     ├──────────────────┤         
    3 │       200        │  │  value 100, we emit rows 2 - 5.    3 │       500        │       
      ├──────────────────┤  │  as matches when the operator is     └──────────────────┘
    4 │       300        │  │  `Operator::Lt` (<) Emitting all
      ├──────────────────┤  │  rows after the first match (row
    5 │       400        │ ─┘  2 buffered side; 100 < 200)
      └──────────────────┘     

Processing Row 2:
  By sorting the streamed side we know

      Sorted Buffered Side                                         Sorted Streamed Side          
      ┌──────────────────┐                                         ┌──────────────────┐         
    1 │       100        │                                       1 │       100        │        
      ├──────────────────┤                                         ├──────────────────┤         
    2 │       200        │ <- Start here when probing for the    2 │       200        │        
      ├──────────────────┤    streamed side row 2.                 ├──────────────────┤         
    3 │       200        │                                       3 │       500        │       
      ├──────────────────┤                                         └──────────────────┘
    4 │       300        │  
      ├──────────────────┤  
    5 │       400        │
      └──────────────────┘     

§Existence Joins (Semi, Anti, Mark)

Existence joins are made magnitudes of times faster with a PiecewiseMergeJoin as we only need to find the min/max value of the streamed side to be able to emit all matches on the buffered side. By putting the side we need to mark onto the sorted buffer side, we can emit all these matches at once.

For less than operations (<) both inputs are to be sorted in descending order and vice versa for greater than (>) operations. SortExec is used to enforce sorting on the buffered side and streamed side does not need to be sorted due to only needing to find the min/max.

For Left Semi, Anti, and Mark joins we swap the inputs so that the marked side is on the buffered side.

The pseudocode for the algorithm looks like this:

// Using the example of a less than `<` operation
let max = max_batch(streamed_batch)

for buffer_row in buffer_batch:
    if buffer_row < max:
        output buffer_batch[buffer_row:]

Only need to find the min/max value and iterate through the buffered side once.

Here is an example: We perform a JoinType::LeftSemi with these two batches and the operator being Operator::Lt(<). Because the operator is Operator::Lt we can find the minimum value in the streamed side; in this case it is 200. We can then advance a pointer from the start of the buffer side until we find the first value that satisfies the predicate. All rows after that first matched value satisfy the condition 200 < x so we can mark all of those rows as matched.

SQL statement:
SELECT *
FROM (VALUES (500), (200), (300)) AS streamed(a)
LEFT SEMI JOIN (VALUES (100), (200), (200), (300), (400)) AS buffered(b)
  ON streamed.a < buffered.b;

         Sorted Buffered Side             Unsorted Streamed Side
           ┌──────────────────┐          ┌──────────────────┐
         1 │       100        │        1 │       500        │
           ├──────────────────┤          ├──────────────────┤
         2 │       200        │        2 │       200        │
           ├──────────────────┤          ├──────────────────┤    
         3 │       200        │        3 │       300        │
           ├──────────────────┤          └──────────────────┘
         4 │       300        │ ─┐       
           ├──────────────────┤  | We emit matches for row 4 - 5
         5 │       400        │ ─┘ on the buffered side.
           └──────────────────┘
            min value: 200

For both types of joins, the buffered side must be sorted ascending for Operator::Lt (<) or Operator::LtEq (<=) and descending for Operator::Gt (>) or Operator::GtEq (>=).

§Partitioning Logic

Piecewise Merge Join requires one buffered side partition + round robin partitioned stream side. A counter is used in the buffered side to coordinate when all streamed partitions are finished execution. This allows for processing the rest of the unmatched rows for Left and Full joins. The last partition that finishes execution will be responsible for outputting the unmatched rows.

§Performance Explanation (cost)

Piecewise Merge Join is used over Nested Loop Join due to its superior performance. Here is the breakdown:

R: Buffered Side S: Streamed Side

§Piecewise Merge Join (PWMJ)

§Classic Join:

Requires sorting the probe side and, for each probe row, scanning the buffered side until the first match is found. Complexity: O(sort(S) + num_of_batches(|S|) * scan(R)).

§Mark Join:

Sorts the probe side, then computes the min/max range of the probe keys and scans the buffered side only within that range.
Complexity: O(|S| + scan(R[range])).

§Nested Loop Join

Compares every row from S with every row from R.
Complexity: O(|S| * |R|).

§Nested Loop Join

Always going to be probe (O(S) * O(R)).

§Further Reference Material

DuckDB blog on Range Joins: Range Joins in DuckDB

Fields§

§buffered: Arc<dyn ExecutionPlan>

Left buffered execution plan

§streamed: Arc<dyn ExecutionPlan>

Right streamed execution plan

§on: (Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)

The two expressions being compared

§operator: Operator

Comparison operator in the range predicate

§join_type: JoinType

How the join is performed

Implementations§

Source§

impl PiecewiseMergeJoinExec

Source

pub fn try_new( buffered: Arc<dyn ExecutionPlan>, streamed: Arc<dyn ExecutionPlan>, on: (Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>), operator: Operator, join_type: JoinType, num_partitions: usize, ) -> Result<Self>

Source

pub fn buffered(&self) -> &Arc<dyn ExecutionPlan>

Reference to buffered side execution plan

Source

pub fn streamed(&self) -> &Arc<dyn ExecutionPlan>

Reference to streamed side execution plan

Source

pub fn join_type(&self) -> JoinType

Join type

Source

pub fn sort_options(&self) -> &SortOptions

Reference to sort options

Source

pub fn probe_side(join_type: &JoinType) -> JoinSide

Get probe side (streamed side) for the PiecewiseMergeJoin In current implementation, probe side is determined according to join type.

Source

pub fn compute_properties( buffered: &Arc<dyn ExecutionPlan>, streamed: &Arc<dyn ExecutionPlan>, schema: SchemaRef, join_type: JoinType, join_on: &(PhysicalExprRef, PhysicalExprRef), ) -> Result<PlanProperties>

Source

pub fn swap_inputs(&self) -> Result<Arc<dyn ExecutionPlan>>

Trait Implementations§

Source§

impl Debug for PiecewiseMergeJoinExec

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl DisplayAs for PiecewiseMergeJoinExec

Source§

fn fmt_as(&self, t: DisplayFormatType, f: &mut Formatter<'_>) -> Result

Format according to DisplayFormatType, used when verbose representation looks different from the default one Read more
Source§

impl ExecutionPlan for PiecewiseMergeJoinExec

Source§

fn name(&self) -> &str

Short name for the ExecutionPlan, such as ‘DataSourceExec’. Read more
Source§

fn as_any(&self) -> &dyn Any

Returns the execution plan as Any so that it can be downcast to a specific implementation.
Source§

fn properties(&self) -> &PlanProperties

Return properties of the output of the ExecutionPlan, such as output ordering(s), partitioning information etc. Read more
Source§

fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>>

Get a list of children ExecutionPlans that act as inputs to this plan. The returned list will be empty for leaf nodes such as scans, will contain a single value for unary nodes, or two values for binary nodes (such as joins).
Source§

fn required_input_distribution(&self) -> Vec<Distribution>

Specifies the data distribution requirements for all the children for this ExecutionPlan, By default it’s [Distribution::UnspecifiedDistribution] for each child,
Source§

fn required_input_ordering(&self) -> Vec<Option<OrderingRequirements>>

Specifies the ordering required for all of the children of this ExecutionPlan. Read more
Source§

fn with_new_children( self: Arc<Self>, children: Vec<Arc<dyn ExecutionPlan>>, ) -> Result<Arc<dyn ExecutionPlan>>

Returns a new ExecutionPlan where all existing children were replaced by the children, in order
Source§

fn execute( &self, partition: usize, context: Arc<TaskContext>, ) -> Result<SendableRecordBatchStream>

Begin execution of partition, returning a Stream of RecordBatches. Read more
Source§

fn static_name() -> &'static str
where Self: Sized,

Short name for the ExecutionPlan, such as ‘DataSourceExec’. Like name but can be called without an instance.
Source§

fn schema(&self) -> SchemaRef

Get the schema for this execution plan
Source§

fn check_invariants(&self, check: InvariantLevel) -> Result<()>

Returns an error if this individual node does not conform to its invariants. These invariants are typically only checked in debug mode. Read more
Source§

fn maintains_input_order(&self) -> Vec<bool>

Returns false if this ExecutionPlan’s implementation may reorder rows within or between partitions. Read more
Source§

fn benefits_from_input_partitioning(&self) -> Vec<bool>

Specifies whether the ExecutionPlan benefits from increased parallelization at its input for each child. Read more
Source§

fn reset_state(self: Arc<Self>) -> Result<Arc<dyn ExecutionPlan>>

Reset any internal state within this ExecutionPlan. Read more
Source§

fn repartitioned( &self, _target_partitions: usize, _config: &ConfigOptions, ) -> Result<Option<Arc<dyn ExecutionPlan>>>

If supported, attempt to increase the partitioning of this ExecutionPlan to produce target_partitions partitions. Read more
Source§

fn metrics(&self) -> Option<MetricsSet>

Return a snapshot of the set of Metrics for this ExecutionPlan. If no Metrics are available, return None. Read more
Source§

fn statistics(&self) -> Result<Statistics>

👎Deprecated since 48.0.0: Use partition_statistics method instead
Returns statistics for this ExecutionPlan node. If statistics are not available, should return Statistics::new_unknown (the default), not an error. Read more
Source§

fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics>

Returns statistics for a specific partition of this ExecutionPlan node. If statistics are not available, should return Statistics::new_unknown (the default), not an error. If partition is None, it returns statistics for the entire plan.
Source§

fn supports_limit_pushdown(&self) -> bool

Returns true if a limit can be safely pushed down through this ExecutionPlan node. Read more
Source§

fn with_fetch(&self, _limit: Option<usize>) -> Option<Arc<dyn ExecutionPlan>>

Returns a fetching variant of this ExecutionPlan node, if it supports fetch limits. Returns None otherwise.
Source§

fn fetch(&self) -> Option<usize>

Gets the fetch count for the operator, None means there is no fetch.
Source§

fn cardinality_effect(&self) -> CardinalityEffect

Gets the effect on cardinality, if known
Source§

fn try_swapping_with_projection( &self, _projection: &ProjectionExec, ) -> Result<Option<Arc<dyn ExecutionPlan>>>

Attempts to push down the given projection into the input of this ExecutionPlan. Read more
Source§

fn gather_filters_for_pushdown( &self, _phase: FilterPushdownPhase, parent_filters: Vec<Arc<dyn PhysicalExpr>>, _config: &ConfigOptions, ) -> Result<FilterDescription>

Collect filters that this node can push down to its children. Filters that are being pushed down from parents are passed in, and the node may generate additional filters to push down. For example, given the plan FilterExec -> HashJoinExec -> DataSourceExec, what will happen is that we recurse down the plan calling ExecutionPlan::gather_filters_for_pushdown: Read more
Source§

fn handle_child_pushdown_result( &self, _phase: FilterPushdownPhase, child_pushdown_result: ChildPushdownResult, _config: &ConfigOptions, ) -> Result<FilterPushdownPropagation<Arc<dyn ExecutionPlan>>>

Handle the result of a child pushdown. This method is called as we recurse back up the plan tree after pushing filters down to child nodes via ExecutionPlan::gather_filters_for_pushdown. It allows the current node to process the results of filter pushdown from its children, deciding whether to absorb filters, modify the plan, or pass filters back up to its parent. Read more
Source§

fn with_new_state( &self, _state: Arc<dyn Any + Send + Sync>, ) -> Option<Arc<dyn ExecutionPlan>>

Injects arbitrary run-time state into this execution plan, returning a new plan instance that incorporates that state if it is relevant to the concrete node implementation. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> ErasedDestructor for T
where T: 'static,