pub trait PruningStatistics {
// Required methods
fn min_values(&self, column: &Column) -> Option<ArrayRef>;
fn max_values(&self, column: &Column) -> Option<ArrayRef>;
fn num_containers(&self) -> usize;
fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
fn row_counts(&self, column: &Column) -> Option<ArrayRef>;
fn contained(
&self,
column: &Column,
values: &HashSet<ScalarValue>,
) -> Option<BooleanArray>;
}Expand description
A source of runtime statistical information to PruningPredicates.
§Supported Information
-
Minimum and maximum values for columns
-
Null counts and row counts for columns
-
Whether the values in a column are contained in a set of literals
§Vectorized Interface
Information for containers / files are returned as Arrow ArrayRef, so
the evaluation happens once on a single RecordBatch, which amortizes the
overhead of evaluating the predicate. This is important when pruning 1000s
of containers which often happens in analytic systems that have 1000s of
potential files to consider.
For example, for the following three files with a single column a:
file1: column a: min=5, max=10
file2: column a: No stats
file2: column a: min=20, max=30PruningStatistics would return:
min_values("a") -> Some([5, Null, 20])
max_values("a") -> Some([10, Null, 30])
min_values("X") -> NoneRequired Methods§
Sourcefn min_values(&self, column: &Column) -> Option<ArrayRef>
fn min_values(&self, column: &Column) -> Option<ArrayRef>
Return the minimum values for the named column, if known.
If the minimum value for a particular container is not known, the
returned array should have null in that row. If the minimum value is
not known for any row, return None.
Note: the returned array must contain Self::num_containers rows
Sourcefn max_values(&self, column: &Column) -> Option<ArrayRef>
fn max_values(&self, column: &Column) -> Option<ArrayRef>
Return the maximum values for the named column, if known.
See Self::min_values for when to return None and null values.
Note: the returned array must contain Self::num_containers rows
Sourcefn num_containers(&self) -> usize
fn num_containers(&self) -> usize
Return the number of containers (e.g. Row Groups) being pruned with these statistics.
This value corresponds to the size of the ArrayRef returned by
Self::min_values, Self::max_values, Self::null_counts,
and Self::row_counts.
Sourcefn null_counts(&self, column: &Column) -> Option<ArrayRef>
fn null_counts(&self, column: &Column) -> Option<ArrayRef>
Return the number of null values for the named column as an
UInt64Array
See Self::min_values for when to return None and null values.
Note: the returned array must contain Self::num_containers rows
Sourcefn row_counts(&self, column: &Column) -> Option<ArrayRef>
fn row_counts(&self, column: &Column) -> Option<ArrayRef>
Return the number of rows for the named column in each container
as an UInt64Array.
See Self::min_values for when to return None and null values.
Note: the returned array must contain Self::num_containers rows
Sourcefn contained(
&self,
column: &Column,
values: &HashSet<ScalarValue>,
) -> Option<BooleanArray>
fn contained( &self, column: &Column, values: &HashSet<ScalarValue>, ) -> Option<BooleanArray>
Returns BooleanArray where each row represents information known
about specific literal values in a column.
For example, Parquet Bloom Filters implement this API to communicate
that values are known not to be present in a Row Group.
The returned array has one row for each container, with the following meanings:
trueif the values incolumnONLY contain values fromvaluesfalseif the values incolumnare NOT ANY ofvaluesnullif the neither of the above holds or is unknown.
If these statistics can not determine column membership for any
container, return None (the default).
Note: the returned array must contain Self::num_containers rows