Struct datafusion_physical_plan::TopK
source · pub struct TopK { /* private fields */ }Expand description
Global TopK
§Background
“Top K” is a common query optimization used for queries such as “find the top 3 customers by revenue”. The (simplified) SQL for such a query might be:
SELECT customer_id, revenue FROM 'sales.csv' ORDER BY revenue DESC limit 3;
The simple plan would be:
> explain SELECT customer_id, revenue FROM sales ORDER BY revenue DESC limit 3;
+--------------+----------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------+
| logical_plan | Limit: 3 |
| | Sort: revenue DESC NULLS FIRST |
| | Projection: customer_id, revenue |
| | TableScan: sales |
+--------------+----------------------------------------+
While this plan produces the correct answer, it will fully sorts the input before discarding everything other than the top 3 elements.
The same answer can be produced by simply keeping track of the top K=3 elements, reducing the total amount of required buffer memory.
§Structure
This operator tracks the top K items using a TopKHeap.
Implementations§
source§impl TopK
impl TopK
sourcepub fn try_new(
partition_id: usize,
schema: SchemaRef,
expr: Vec<PhysicalSortExpr>,
k: usize,
batch_size: usize,
runtime: Arc<RuntimeEnv>,
metrics: &ExecutionPlanMetricsSet,
partition: usize
) -> Result<Self>
pub fn try_new( partition_id: usize, schema: SchemaRef, expr: Vec<PhysicalSortExpr>, k: usize, batch_size: usize, runtime: Arc<RuntimeEnv>, metrics: &ExecutionPlanMetricsSet, partition: usize ) -> Result<Self>
Create a new TopK that stores the top k values, as
defined by the sort expressions in expr.
sourcepub fn insert_batch(&mut self, batch: RecordBatch) -> Result<()>
pub fn insert_batch(&mut self, batch: RecordBatch) -> Result<()>
Insert batch, remembering if any of its values are among
the top k seen so far.
sourcepub fn emit(self) -> Result<SendableRecordBatchStream>
pub fn emit(self) -> Result<SendableRecordBatchStream>
Returns the top k results broken into batch_size RecordBatches, consuming the heap