grafeo_core::execution::factorized_chunk

Struct FactorizedChunk

pub struct FactorizedChunk { /* private fields */ }

Expand description

A chunk that supports factorized representation across multiple levels.

Columns are organized in groups by their factorization level:

Level 0 (flat): Base columns, one value per logical row
Level 1 (unflat): First expansion, grouped by level 0
Level 2 (unflat): Second expansion, grouped by level 1
And so on…

§State Management

The chunk maintains a ChunkState that provides:

Cached multiplicities for O(1) aggregate access
Selection vector support for lazy filtering
Generation tracking for cache invalidation

Implementations§

Source §

impl FactorizedChunk

Source

pub fn empty() -> Self

Creates an empty factorized chunk.

Source

pub fn from_flat(chunk: &DataChunk, column_names: Vec<String>) -> Self

Creates a factorized chunk from a flat DataChunk.

The resulting chunk has a single level (level 0) with all columns flat.

Source

pub fn with_flat_level( columns: Vec<ValueVector>, column_names: Vec<String>, ) -> Self

Creates a factorized chunk with a single flat level.

Source

pub fn level_count(&self) -> usize

Returns the number of factorization levels.

Source

pub fn logical_row_count(&self) -> usize

Returns the logical row count (full Cartesian product size).

Source

pub fn physical_size(&self) -> usize

Returns the physical storage size (actual values stored).

Source

pub fn chunk_state(&self) -> &ChunkState

Returns the chunk state.

Source

pub fn chunk_state_mut(&mut self) -> &mut ChunkState

Returns mutable access to the chunk state.

Source

pub fn path_multiplicities_cached(&mut self) -> Arc<[usize]>

Returns path multiplicities, computing once and caching.

This is the key optimization for aggregation: multiplicities are computed once and reused for all aggregates (COUNT, SUM, AVG, etc.).

§Example

let mults = chunk.path_multiplicities_cached();
let sum = chunk.sum_deepest_with_multiplicities(0, &mults);
let avg = chunk.avg_deepest_with_multiplicities(0, &mults);

Source

pub fn level(&self, index: usize) -> Option<&FactorizationLevel>

Returns a level by index.

Source

pub fn level_mut(&mut self, index: usize) -> Option<&mut FactorizationLevel>

Returns a mutable level by index.

Source

pub fn add_level( &mut self, columns: Vec<ValueVector>, column_names: Vec<String>, offsets: &[u32], )

Adds a new factorization level for expansion results.

The new level’s multiplicities determine how many values each parent in the previous level expands to.

§Arguments

columns - Columns at the new level
column_names - Names for the new columns
offsets - Offset array where offsets[i] is the start index for parent i

Source

pub fn add_factorized_level(&mut self, level: FactorizationLevel)

Adds a level with pre-computed factorized vectors.

Source

pub fn flatten(&self) -> DataChunk

Flattens to a regular DataChunk (materializes the Cartesian product).

All levels are expanded into flat rows.

Source

pub fn logical_row_iter(&self) -> FactorizedRowIterator<'_> ⓘ

Returns an iterator over logical rows without materializing.

Each iteration yields a vector of physical indices, one per level.

Source

pub fn total_column_count(&self) -> usize

Gets the total number of columns across all levels.

Source

pub fn all_column_names(&self) -> Vec<String>

Gets all column names in order across all levels.

Source

pub fn filter_deepest<F>(&self, column_idx: usize, predicate: F) -> Option<Self>
where F: Fn(&Value) -> bool,

Filters the deepest level in-place using a predicate on column values.

This is the key optimization: instead of flattening and filtering all rows, we filter only at the deepest level and update parent multiplicities.

§Arguments

column_idx - Column index within the deepest level to filter on
predicate - Function that returns true for values to keep

§Returns

A new FactorizedChunk with filtered values, or None if all rows are filtered out.

Source

pub fn filter_deepest_multi<F>(&self, predicate: F) -> Option<Self>
where F: Fn(&[Value]) -> bool,

Filters the deepest level using a multi-column predicate.

This allows filtering based on values from multiple columns in the deepest level.

§Arguments

predicate - Function that takes a slice of values (one per column) and returns true to keep

Source

pub fn count_rows(&self) -> usize

Computes COUNT(*) without flattening - returns the logical row count.

This is O(n) where n is the number of physical values, instead of O(m) where m is the number of logical rows (which can be exponentially larger).

§Example

For a 3-level chunk:

Level 0: 100 sources
Level 1: 10 neighbors each = 1,000 physical
Level 2: 10 neighbors each = 10,000 physical
Logical rows = 100 * 10 * 10 = 10,000

count_rows() returns 10,000 by computing from multiplicities, not by iterating through all logical rows.

Source

pub fn compute_path_multiplicities(&self) -> Vec<usize>

Computes the effective multiplicity for each value at the deepest level.

This is how many times each value would appear in the flattened result. For example, if a source has 3 first-hop neighbors and each has 2 second-hop neighbors, each first-hop value has multiplicity 2 (appearing in 2 paths).

§Returns

A vector where result[i] is the multiplicity of physical value i at the deepest level. The sum of all multiplicities equals logical_row_count().

§Note

For repeated access (e.g., computing multiple aggregates), prefer using path_multiplicities_cached which caches the result and avoids O(levels) recomputation.

Source