Struct datafusion::physical_optimizer::pruning::PruningPredicate

source ·

pub struct PruningPredicate { /* private fields */ }

Expand description

Used to prove that arbitrary predicates (boolean expression) can not possibly evaluate to true given information about a column provided by PruningStatistics.

§Introduction

PruningPredicate analyzes filter expressions using statistics such as min/max values and null counts, attempting to prove a “container” (e.g. Parquet Row Group) can be skipped without reading the actual data, potentially leading to significant performance improvements.

For example, PruningPredicates are used to prune Parquet Row Groups based on the min/max values found in the Parquet metadata. If the PruningPredicate can prove that the filter can never evaluate to true for any row in the Row Group, the entire Row Group is skipped during query execution.

The PruningPredicate API is general, and can be used for pruning other types of containers (e.g. files) based on statistics that may be known from external catalogs (e.g. Delta Lake) or other sources. How this works is a subtle topic. See the Background and Implementation section for details.

PruningPredicate supports:

Arbitrary expressions (including user defined functions)
Vectorized evaluation (provide more than one set of statistics at a time) so it is suitable for pruning 1000s of containers.
Any source of information that implements the PruningStatistics trait (not just Parquet metadata).

§Example

See the pruning.rs example in the datafusion-examples for a complete example of how to use PruningPredicate to prune files based on min/max values.

Given an expression like x = 5 and statistics for 3 containers (Row Groups, files, etc) A, B, and C:

  A: {x_min = 0, x_max = 4}
  B: {x_min = 2, x_max = 10}
  C: {x_min = 5, x_max = 8}

PruningPredicate will conclude that the rows in container A can never be true (as the maximum value is only 4), so it can be pruned:

A: false (no rows could possibly match x = 5)
B: true  (rows might match x = 5)
C: true  (rows might match x = 5)

See PruningPredicate::try_new and PruningPredicate::prune for more information.

§Background

§Boolean Tri-state logic

To understand the details of the rest of this documentation, it is important to understand how the tri-state boolean logic in SQL works. As this is somewhat esoteric, we review it here.

SQL has a notion of NULL that represents the value is “unknown” and this uncertainty propagates through expressions. SQL NULL behaves very differently than the NULL in most other languages where it is a special, sentinel value (e.g. 0 in C/C++). While representing uncertainty with NULL is powerful and elegant, SQL NULLs are often deeply confusing when first encountered as they behave differently than most programmers may expect.

In most other programming languages,

a == NULL evaluates to true if a also had the value NULL
a == NULL evaluates to false if a has any other value

However, in SQL a = NULL always evaluates to NULL (never true or false):

Expression	Result
`1 = NULL`	`NULL`
`NULL = NULL`	`NULL`

Also important is how AND and OR works with tri-state boolean logic as (perhaps counterintuitively) the result is not always NULL. While consistent with the notion of NULL representing “unknown”, this is again, often deeply confusing 🤯 when first encountered.

Expression	Result	Intuition
`NULL AND true`	`NULL`	The `NULL` stands for “unknown” and if it were `true` or `false` the overall expression value could change
`NULL AND false`	`false`	If the `NULL` was either `true` or `false` the overall expression is still `false`
`NULL AND NULL`	`NULL`

Expression	Result	Intuition
`NULL OR true`	`true`	If the `NULL` was either `true` or `false` the overall expression is still `true`
`NULL OR false`	`NULL`	The `NULL` stands for “unknown” and if it were `true` or `false` the overall expression value could change
`NULL OR NULL`	`NULL`

§SQL Filter Semantics

The SQL WHERE clause has a boolean expression, often called a filter or predicate. The semantics of this predicate are that the query evaluates the predicate for each row in the input tables and:

Rows that evaluate to true are returned in the query results
Rows that evaluate to false are not returned (“filtered out” or “pruned” or “skipped”).
Rows that evaluate to NULL are NOT returned (also “filtered out”). Note: this treatment of NULL is DIFFERENT than how NULL is treated in the rewritten predicate described below.

§`PruningPredicate` Implementation

Armed with the information in the Background section, we can now understand how the PruningPredicate logic works.

§Interface

Inputs

An input schema describing what columns exist
A predicate (expression that evaluates to a boolean)
PruningStatistics that provides information about columns in that schema, for multiple “containers”. For each column in each container, it provides optional information on contained values, min_values, max_values, null_counts counts, and row_counts counts.

Outputs: A (non null) boolean value for each container:

true: There MAY be rows that match the predicate
false: There are no rows that could possibly match the predicate (the predicate can never possibly be true). The container can be pruned (skipped) entirely.

Note that in order to be correct, PruningPredicate must return false only if it can determine that for all rows in the container, the predicate could never evaluate to true (always evaluates to either NULL or false).

§Contains Analysis and Min/Max Rewrite

PruningPredicate works by first analyzing the predicate to see what LiteralGuarantee must hold for the predicate to be true.

Then, the PruningPredicate rewrites the original predicate into an expression that references the min/max values of each column in the original predicate.

When the min/max values are actually substituted in to this expression and evaluated, the result means

true: there MAY be rows that pass the predicate, KEEPS the container
NULL: there MAY be rows that pass the predicate, KEEPS the container Note that rewritten predicate can evaluate to NULL when some of the min/max values are not known. Note that this is different than the SQL filter semantics where NULL means the row is filtered out.
false: there are no rows that could possibly match the predicate, PRUNES the container

For example, given a column x, the x_min, x_max, x_null_count, and x_row_count represent the minimum and maximum values, the null count of column x, and the row count of column x, provided by the PruningStatistics. x_null_count and x_row_count are used to handle the case where the column x is known to be all NULLs. Note this is different from knowing nothing about the column x, which confusingly is encoded by returning NULL for the min/max values from PruningStatistics::max_values and PruningStatistics::min_values.

Here are some examples of the rewritten predicates:

Original Predicate	Rewritten Predicate
`x = 5`	`CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END`
`x < 5`	`CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END`
`x = 5 AND y = 10`	`CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END`
`x IS NULL`	`CASE WHEN x_null_count = x_row_count THEN false ELSE x_null_count > 0 END`
`CAST(x as int) = 5`	`CASE WHEN x_null_count = x_row_count THEN false ELSE CAST(x_min as int) <= 5 AND 5 <= CAST(x_max as int) END`

§Predicate Evaluation

The PruningPredicate works in two passes

First pass: For each LiteralGuarantee calls PruningStatistics::contained and rules out containers where the LiteralGuarantees are not satisfied

Second Pass: Evaluates the rewritten expression using the min/max/null_counts/row_counts values for each column for each container. For any container that this expression evaluates to false, it rules out those containers.

§Example 1

Given the predicate, x = 5 AND y = 10, the rewritten predicate would look like:

CASE
    WHEN x_null_count = x_row_count THEN false
    ELSE x_min <= 5 AND 5 <= x_max
END
AND
CASE
    WHEN y_null_count = y_row_count THEN false
    ELSE y_min <= 10 AND 10 <= y_max
END

If we know that for a given container, x is between 1 and 100 and we know that y is between 4 and 7, we know nothing about the null count and row count of x and y, the input statistics might look like:

Column	Value
`x_min`	`1`
`x_max`	`100`
`x_null_count`	`null`
`x_row_count`	`null`
`y_min`	`4`
`y_max`	`7`
`y_null_count`	`null`
`y_row_count`	`null`

When these statistics values are substituted in to the rewritten predicate and simplified, the result is false:

CASE WHEN null = null THEN false ELSE 1 <= 5 AND 5 <= 100 END AND CASE WHEN null = null THEN false ELSE 4 <= 10 AND 10 <= 7 END
null = null is null which is not true, so the CASE expression will use the ELSE clause
1 <= 5 AND 5 <= 100 AND 4 <= 10 AND 10 <= 7
true AND true AND true AND false
false

Returning false means the container can be pruned, which matches the intuition that x = 5 AND y = 10 can’t be true for any row if all values of y are 7 or less.

If, for some other container, we knew y was between the values 4 and 15, then the rewritten predicate evaluates to true (verifying this is left as an exercise to the reader – are you still here?), and the container could not be pruned. The intuition is that there may be rows where the predicate might evaluate to true, and the only way to find out is to do more analysis, for example by actually reading the data and evaluating the predicate row by row.

§Example 2

Given the same predicate, x = 5 AND y = 10, the rewritten predicate would look like the same as example 1:

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_min <= 5 AND 5 <= x_max
END
AND
CASE
  WHEN y_null_count = y_row_count THEN false
 ELSE y_min <= 10 AND 10 <= y_max
END

If we know that for another given container, x_min is NULL and x_max is NULL (the min/max values are unknown), x_null_count is 100 and x_row_count is 100; we know that y is between 4 and 7, but we know nothing about the null count and row count of y. The input statistics might look like:

Column	Value
`x_min`	`null`
`x_max`	`null`
`x_null_count`	`100`
`x_row_count`	`100`
`y_min`	`4`
`y_max`	`7`
`y_null_count`	`null`
`y_row_count`	`null`

When these statistics values are substituted in to the rewritten predicate and simplified, the result is false:

CASE WHEN 100 = 100 THEN false ELSE null <= 5 AND 5 <= null END AND CASE WHEN null = null THEN false ELSE 4 <= 10 AND 10 <= 7 END
Since 100 = 100 is true, the CASE expression will use the THEN clause, i.e. false
The other CASE expression will use the ELSE clause, i.e. 4 <= 10 AND 10 <= 7
false AND true
false

Returning false means the container can be pruned, which matches the intuition that x = 5 AND y = 10 can’t be true for all values in x are known to be NULL.

PruningPredicate implements the type of min/max pruning described in Section 3.3.3 of the Snowflake SIGMOD Paper. The technique is described by various research such as small materialized aggregates, zone maps, and data skipping.

Struct datafusion::physical_optimizer::pruning::PruningPredicateCopy item path

§Introduction

§Example

§Background

§Boolean Tri-state logic

§SQL Filter Semantics

§PruningPredicate Implementation

§Interface

§Contains Analysis and Min/Max Rewrite

§Predicate Evaluation

§Example 1

§Example 2

§Related Work

Implementations§

impl PruningPredicate

pub fn try_new(expr: Arc<dyn PhysicalExpr>, schema: SchemaRef) -> Result<Self>

pub fn prune<S: PruningStatistics>(&self, statistics: &S) -> Result<Vec<bool>>

pub fn schema(&self) -> &SchemaRef

pub fn orig_expr(&self) -> &Arc<dyn PhysicalExpr>

pub fn predicate_expr(&self) -> &Arc<dyn PhysicalExpr>

pub fn literal_guarantees(&self) -> &[LiteralGuarantee]

pub fn always_true(&self) -> bool

pub fn literal_columns(&self) -> Vec<String>

Trait Implementations§

impl Clone for PruningPredicate

fn clone(&self) -> PruningPredicate

fn clone_from(&mut self, source: &Self)

impl Debug for PruningPredicate

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl Freeze for PruningPredicate

impl !RefUnwindSafe for PruningPredicate

impl Send for PruningPredicate

impl Sync for PruningPredicate

impl Unpin for PruningPredicate

impl !UnwindSafe for PruningPredicate

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Same for T

type Output = T

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

Struct datafusion::physical_optimizer::pruning::PruningPredicate

§`PruningPredicate` Implementation

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,