pub struct PruningPredicate { /* private fields */ }
Expand description

Used to prove that arbitrary predicates (boolean expression) can not possibly evaluate to true given information about a column provided by PruningStatistics.

§Introduction

PruningPredicate analyzes filter expressions using statistics such as min/max values and null counts, attempting to prove a “container” (e.g. Parquet Row Group) can be skipped without reading the actual data, potentially leading to significant performance improvements.

For example, PruningPredicates are used to prune Parquet Row Groups based on the min/max values found in the Parquet metadata. If the PruningPredicate can prove that the filter can never evaluate to true for any row in the Row Group, the entire Row Group is skipped during query execution.

The PruningPredicate API is general, and can be used for pruning other types of containers (e.g. files) based on statistics that may be known from external catalogs (e.g. Delta Lake) or other sources. How this works is a subtle topic. See the Background and Implementation section for details.

PruningPredicate supports:

  1. Arbitrary expressions (including user defined functions)

  2. Vectorized evaluation (provide more than one set of statistics at a time) so it is suitable for pruning 1000s of containers.

  3. Any source of information that implements the PruningStatistics trait (not just Parquet metadata).

§Example

See the pruning.rs example in the datafusion-examples for a complete example of how to use PruningPredicate to prune files based on min/max values.

Given an expression like x = 5 and statistics for 3 containers (Row Groups, files, etc) A, B, and C:

  A: {x_min = 0, x_max = 4}
  B: {x_min = 2, x_max = 10}
  C: {x_min = 5, x_max = 8}

PruningPredicate will conclude that the rows in container A can never be true (as the maximum value is only 4), so it can be pruned:

A: false (no rows could possibly match x = 5)
B: true  (rows might match x = 5)
C: true  (rows might match x = 5)

See PruningPredicate::try_new and PruningPredicate::prune for more information.

§Background

§Boolean Tri-state logic

To understand the details of the rest of this documentation, it is important to understand how the tri-state boolean logic in SQL works. As this is somewhat esoteric, we review it here.

SQL has a notion of NULL that represents the value is “unknown” and this uncertainty propagates through expressions. SQL NULL behaves very differently than the NULL in most other languages where it is a special, sentinel value (e.g. 0 in C/C++). While representing uncertainty with NULL is powerful and elegant, SQL NULLs are often deeply confusing when first encountered as they behave differently than most programmers may expect.

In most other programming languages,

  • a == NULL evaluates to true if a also had the value NULL
  • a == NULL evaluates to false if a has any other value

However, in SQL a = NULL always evaluates to NULL (never true or false):

ExpressionResult
1 = NULLNULL
NULL = NULLNULL

Also important is how AND and OR works with tri-state boolean logic as (perhaps counterintuitively) the result is not always NULL. While consistent with the notion of NULL representing “unknown”, this is again, often deeply confusing 🤯 when first encountered.

ExpressionResultIntuition
NULL AND trueNULLThe NULL stands for “unknown” and if it were true or false the overall expression value could change
NULL AND falsefalseIf the NULL was either true or false the overall expression is still false
NULL AND NULLNULL
ExpressionResultIntuition
NULL OR truetrueIf the NULL was either true or false the overall expression is still true
NULL OR falseNULLThe NULL stands for “unknown” and if it were true or false the overall expression value could change
NULL OR NULLNULL

§SQL Filter Semantics

The SQL WHERE clause has a boolean expression, often called a filter or predicate. The semantics of this predicate are that the query evaluates the predicate for each row in the input tables and:

  • Rows that evaluate to true are returned in the query results

  • Rows that evaluate to false are not returned (“filtered out” or “pruned” or “skipped”).

  • Rows that evaluate to NULL are NOT returned (also “filtered out”). Note: this treatment of NULL is DIFFERENT than how NULL is treated in the rewritten predicate described below.

§PruningPredicate Implementation

Armed with the information in the Background section, we can now understand how the PruningPredicate logic works.

§Interface

Inputs

  1. An input schema describing what columns exist

  2. A predicate (expression that evaluates to a boolean)

  3. PruningStatistics that provides information about columns in that schema, for multiple “containers”. For each column in each container, it provides optional information on contained values, min_values, max_values, null_counts counts, and row_counts counts.

Outputs: A (non null) boolean value for each container:

  • true: There MAY be rows that match the predicate

  • false: There are no rows that could possibly match the predicate (the predicate can never possibly be true). The container can be pruned (skipped) entirely.

Note that in order to be correct, PruningPredicate must return false only if it can determine that for all rows in the container, the predicate could never evaluate to true (always evaluates to either NULL or false).

§Contains Analysis and Min/Max Rewrite

PruningPredicate works by first analyzing the predicate to see what LiteralGuarantee must hold for the predicate to be true.

Then, the PruningPredicate rewrites the original predicate into an expression that references the min/max values of each column in the original predicate.

When the min/max values are actually substituted in to this expression and evaluated, the result means

  • true: there MAY be rows that pass the predicate, KEEPS the container

  • NULL: there MAY be rows that pass the predicate, KEEPS the container Note that rewritten predicate can evaluate to NULL when some of the min/max values are not known. Note that this is different than the SQL filter semantics where NULL means the row is filtered out.

  • false: there are no rows that could possibly match the predicate, PRUNES the container

For example, given a column x, the x_min, x_max, x_null_count, and x_row_count represent the minimum and maximum values, the null count of column x, and the row count of column x, provided by the PruningStatistics. x_null_count and x_row_count are used to handle the case where the column x is known to be all NULLs. Note this is different from knowing nothing about the column x, which confusingly is encoded by returning NULL for the min/max values from PruningStatistics::max_values and PruningStatistics::min_values.

Here are some examples of the rewritten predicates:

Original PredicateRewritten Predicate
x = 5CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END
x < 5CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END
x = 5 AND y = 10CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END
x IS NULLCASE WHEN x_null_count = x_row_count THEN false ELSE x_null_count > 0 END
CAST(x as int) = 5CASE WHEN x_null_count = x_row_count THEN false ELSE CAST(x_min as int) <= 5 AND 5 <= CAST(x_max as int) END

§Predicate Evaluation

The PruningPredicate works in two passes

First pass: For each LiteralGuarantee calls PruningStatistics::contained and rules out containers where the LiteralGuarantees are not satisfied

Second Pass: Evaluates the rewritten expression using the min/max/null_counts/row_counts values for each column for each container. For any container that this expression evaluates to false, it rules out those containers.

§Example 1

Given the predicate, x = 5 AND y = 10, the rewritten predicate would look like:

CASE
    WHEN x_null_count = x_row_count THEN false
    ELSE x_min <= 5 AND 5 <= x_max
END
AND
CASE
    WHEN y_null_count = y_row_count THEN false
    ELSE y_min <= 10 AND 10 <= y_max
END

If we know that for a given container, x is between 1 and 100 and we know that y is between 4 and 7, we know nothing about the null count and row count of x and y, the input statistics might look like:

ColumnValue
x_min1
x_max100
x_null_countnull
x_row_countnull
y_min4
y_max7
y_null_countnull
y_row_countnull

When these statistics values are substituted in to the rewritten predicate and simplified, the result is false:

  • CASE WHEN null = null THEN false ELSE 1 <= 5 AND 5 <= 100 END AND CASE WHEN null = null THEN false ELSE 4 <= 10 AND 10 <= 7 END
  • null = null is null which is not true, so the CASE expression will use the ELSE clause
  • 1 <= 5 AND 5 <= 100 AND 4 <= 10 AND 10 <= 7
  • true AND true AND true AND false
  • false

Returning false means the container can be pruned, which matches the intuition that x = 5 AND y = 10 can’t be true for any row if all values of y are 7 or less.

If, for some other container, we knew y was between the values 4 and 15, then the rewritten predicate evaluates to true (verifying this is left as an exercise to the reader – are you still here?), and the container could not be pruned. The intuition is that there may be rows where the predicate might evaluate to true, and the only way to find out is to do more analysis, for example by actually reading the data and evaluating the predicate row by row.

§Example 2

Given the same predicate, x = 5 AND y = 10, the rewritten predicate would look like the same as example 1:

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_min <= 5 AND 5 <= x_max
END
AND
CASE
  WHEN y_null_count = y_row_count THEN false
 ELSE y_min <= 10 AND 10 <= y_max
END

If we know that for another given container, x_min is NULL and x_max is NULL (the min/max values are unknown), x_null_count is 100 and x_row_count is 100; we know that y is between 4 and 7, but we know nothing about the null count and row count of y. The input statistics might look like:

ColumnValue
x_minnull
x_maxnull
x_null_count100
x_row_count100
y_min4
y_max7
y_null_countnull
y_row_countnull

When these statistics values are substituted in to the rewritten predicate and simplified, the result is false:

  • CASE WHEN 100 = 100 THEN false ELSE null <= 5 AND 5 <= null END AND CASE WHEN null = null THEN false ELSE 4 <= 10 AND 10 <= 7 END
  • Since 100 = 100 is true, the CASE expression will use the THEN clause, i.e. false
  • The other CASE expression will use the ELSE clause, i.e. 4 <= 10 AND 10 <= 7
  • false AND true
  • false

Returning false means the container can be pruned, which matches the intuition that x = 5 AND y = 10 can’t be true for all values in x are known to be NULL.

PruningPredicate implements the type of min/max pruning described in Section 3.3.3 of the Snowflake SIGMOD Paper. The technique is described by various research such as small materialized aggregates, zone maps, and data skipping.

Implementations§

source§

impl PruningPredicate

source

pub fn try_new(expr: Arc<dyn PhysicalExpr>, schema: SchemaRef) -> Result<Self>

Try to create a new instance of PruningPredicate

This will translate the provided expr filter expression into a pruning predicate.

A pruning predicate is one that has been rewritten in terms of the min and max values of column references and that evaluates to FALSE if the filter predicate would evaluate FALSE for every row whose values fell within the min / max ranges (aka could be pruned).

The pruning predicate evaluates to TRUE or NULL if the filter predicate might evaluate to TRUE for at least one row whose values fell within the min/max ranges (in other words they might pass the predicate)

For example, the filter expression (column / 2) = 4 becomes the pruning predicate (column_min / 2) <= 4 && 4 <= (column_max / 2))

See the struct level documentation on PruningPredicate for more details.

source

pub fn prune<S: PruningStatistics>(&self, statistics: &S) -> Result<Vec<bool>>

For each set of statistics, evaluates the pruning predicate and returns a bool with the following meaning for a all rows whose values match the statistics:

true: There MAY be rows that match the predicate

false: There are no rows that could possibly match the predicate

Note: the predicate passed to prune should already be simplified as much as possible (e.g. this pass doesn’t handle some expressions like b = false, but it does handle the simplified version b. See ExprSimplifier to simplify expressions.

source

pub fn schema(&self) -> &SchemaRef

Return a reference to the input schema

source

pub fn orig_expr(&self) -> &Arc<dyn PhysicalExpr>

Returns a reference to the physical expr used to construct this pruning predicate

source

pub fn predicate_expr(&self) -> &Arc<dyn PhysicalExpr>

Returns a reference to the predicate expr

source

pub fn literal_guarantees(&self) -> &[LiteralGuarantee]

Returns a reference to the literal guarantees

source

pub fn always_true(&self) -> bool

Returns true if this pruning predicate can not prune anything.

This happens if the predicate is a literal true and literal_guarantees is empty.

source

pub fn literal_columns(&self) -> Vec<String>

Names of the columns that are known to be / not be in a set of literals (constants). These are the columns the that may be passed to PruningStatistics::contained during pruning.

This is useful to avoid fetching statistics for columns that will not be used in the predicate. For example, it can be used to avoid reading uneeded bloom filters (a non trivial operation).

Trait Implementations§

source§

impl Clone for PruningPredicate

source§

fn clone(&self) -> PruningPredicate

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for PruningPredicate

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T> Same for T

§

type Output = T

Should always be Self
source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V