Trait datafusion_expr::groups_accumulator::GroupsAccumulator

source ·

pub trait GroupsAccumulator: Send {
    // Required methods
    fn update_batch(
        &mut self,
        values: &[ArrayRef],
        group_indices: &[usize],
        opt_filter: Option<&BooleanArray>,
        total_num_groups: usize
    ) -> Result<()>;
    fn evaluate(&mut self, emit_to: EmitTo) -> Result<ArrayRef>;
    fn state(&mut self, emit_to: EmitTo) -> Result<Vec<ArrayRef>>;
    fn merge_batch(
        &mut self,
        values: &[ArrayRef],
        group_indices: &[usize],
        opt_filter: Option<&BooleanArray>,
        total_num_groups: usize
    ) -> Result<()>;
    fn size(&self) -> usize;
}

Expand description

GroupAccumulator implements a single aggregate (e.g. AVG) and stores the state for all groups internally.

§Notes on Implementing `GroupAccumulator`

All aggregates must first implement the simpler Accumulator trait, which handles state for a single group. Implementing GroupsAccumulator is optional and is harder to implement than Accumulator, but can be much faster for queries with many group values. See the Aggregating Millions of Groups Fast blog for more background.

§Details

Each group is assigned a group_index by the hash table and each accumulator manages the specific state, one per group_index.

group_indexes are contiguous (there aren’t gaps), and thus it is expected that each GroupAccumulator will use something like Vec<..> to store the group states.

Required Methods§

source

fn update_batch( &mut self, values: &[ArrayRef], group_indices: &[usize], opt_filter: Option<&BooleanArray>, total_num_groups: usize ) -> Result<()>

Updates the accumulator’s state from its arguments, encoded as a vector of ArrayRefs.

values: the input arguments to the accumulator
group_indices: To which groups do the rows in values belong, group id)
opt_filter: if present, only update aggregate state using values[i] if opt_filter[i] is true
total_num_groups: the number of groups (the largest group_index is thus total_num_groups - 1).

Note that subsequent calls to update_batch may have larger total_num_groups as new groups are seen.

source

fn evaluate(&mut self, emit_to: EmitTo) -> Result<ArrayRef>

Returns the final aggregate value for each group as a single RecordBatch, resetting the internal state.

The rows returned must be in group_index order: The value for group_index 0, followed by 1, etc. Any group_index that did not have values, should be null.

For example, a SUM accumulator maintains a running sum for each group, and evaluate will produce that running sum as its output for all groups, in group_index order

If emit_to`` is [EmitTo::All`], the accumulator should return all groups and release / reset its internal state equivalent to when it was first created.

If emit_to is EmitTo::First, only the first n groups should be emitted and the state for those first groups removed. State for the remaining groups must be retained for future use. The group_indices on subsequent calls to update_batch or merge_batch will be shifted down by n. See EmitTo::First for more details.

source

fn state(&mut self, emit_to: EmitTo) -> Result<Vec<ArrayRef>>

Returns the intermediate aggregate state for this accumulator, used for multi-phase grouping, resetting its internal state.

For example, AVG might return two arrays: SUM and COUNT but the MIN aggregate would just return a single array.

Note more sophisticated internal state can be passed as single StructArray rather than multiple arrays.

See Self::evaluate for details on the required output order and emit_to.

source

fn merge_batch( &mut self, values: &[ArrayRef], group_indices: &[usize], opt_filter: Option<&BooleanArray>, total_num_groups: usize ) -> Result<()>

Merges intermediate state (the output from Self::state) into this accumulator’s values.

For some aggregates (such as SUM), merge_batch is the same as update_batch, but for some aggregates (such as COUNT, where the partial counts must be summed) the operations differ. See Self::state for more details on how state is used and merged.

values: arrays produced from calling state previously to the accumulator

Other arguments are the same as for Self::update_batch;

source

fn size(&self) -> usize

Amount of memory used to store the state of this accumulator, in bytes. This function is called once per batch, so it should be O(n) to compute, not O(num_groups)

Trait datafusion_expr::groups_accumulator::GroupsAccumulator

§Notes on Implementing GroupAccumulator

§Details

Required Methods§

fn update_batch( &mut self, values: &[ArrayRef], group_indices: &[usize], opt_filter: Option<&BooleanArray>, total_num_groups: usize ) -> Result<()>

fn evaluate(&mut self, emit_to: EmitTo) -> Result<ArrayRef>

fn state(&mut self, emit_to: EmitTo) -> Result<Vec<ArrayRef>>

fn merge_batch( &mut self, values: &[ArrayRef], group_indices: &[usize], opt_filter: Option<&BooleanArray>, total_num_groups: usize ) -> Result<()>

fn size(&self) -> usize

Implementors§

§Notes on Implementing `GroupAccumulator`