Expand description
samkhya-datafusion — DataFusion adapter for samkhya-core.
§Integration model
DataFusion 46.0 has two distinct surfaces that the mainline planner actually consults for cardinality:
ExecutionPlan::statistics()on each node of the physical plan.FilterExec,ProjectionExec,HashJoinExecetc. propagate from their child’sstatistics()upward, so corrections placed at the leaf scan level reach the top of the plan.TableProvider::statistics()is not consulted by the mainline physical planner: it is reserved for downstream forks / custom optimizer rules. We still implement it for completeness, but we do not rely on it as the injection path.
samkhya therefore wires corrections in at three layers, which together form the integration model:
SamkhyaTableProvider— aTableProviderwrapper that delegates every method to an inner provider but overridesstatistics()with samkhya-correctedColumnStatistics, and — critically — overridesscan()to return a physicalSamkhyaStatsExecwrapping the inner provider’s exec. The exec wrapper is what makesphysical.statistics()?.num_rowsreflect samkhya’s corrections, because the mainline planner uses the exec’s stats, not the table provider’s.SamkhyaStatsExec— a passthroughExecutionPlanthat overridesstatistics()to return a presetStatistics, delegating every other method to the inner exec. This is the physical-layer hook the planner actually consults.SamkhyaOptimizerRule— implements bothOptimizerRule(logical, observe-only) andPhysicalOptimizerRule(physical, validates the wrappers are in place and surfaces a diagnostic count ofSamkhyaStatsExecleaves seen). Registration of the rule is the explicit integration ceremony — operators audit theSessionState::physical_optimizers()slice to confirm samkhya is wired in.
use std::sync::Arc;
use datafusion::execution::session_state::SessionStateBuilder;
use datafusion::execution::context::SessionContext;
use datafusion::prelude::SessionConfig;
use samkhya_datafusion::{SamkhyaOptimizerRule, SamkhyaTableProvider};
use samkhya_core::stats::ColumnStats;
let rule = Arc::new(SamkhyaOptimizerRule::new());
let state = SessionStateBuilder::new()
.with_config(SessionConfig::new())
.with_default_features()
.with_optimizer_rule(rule.clone())
.with_physical_optimizer_rule(rule.clone())
.build();
let ctx = SessionContext::new_with_state(state);
let wrapped = SamkhyaTableProvider::new(inner_provider)
.with_column_stats(0, ColumnStats::new().with_row_count(1_000_000));
ctx.register_table("t", Arc::new(wrapped))?;All values translated into DataFusion’s Precision<T> are marked
Precision::Inexact — samkhya’s corrections are feedback-driven,
clamped by the LpBound pessimistic ceiling, and never exact catalog
counts. This is the conservative posture the safety envelope requires.
§Compatibility
Compiled and tested against DataFusion 46.0.1 (released March 2025).
Version 46 is the first release with a stable OptimizerRule trait
surface (name, apply_order, supports_rewrite, rewrite), the
PhysicalOptimizerRule trait, and the Precision<T> /
ColumnStatistics / Statistics types we depend on for cardinality
correction. Newer versions should also work, with any signature drift
caught by the wrap_provider integration test and the
stats_propagation_demo example binary.
Re-exports§
pub use optimizer_rule::SamkhyaOptimizerRule;pub use physical_plan::SamkhyaStatsExec;pub use stats_provider::to_datafusion_column_statistics;pub use table_provider::SamkhyaTableProvider;
Modules§
- optimizer_
rule SamkhyaOptimizerRule— DataFusion integration point for samkhya’s cardinality corrections.- physical_
plan SamkhyaStatsExec— theExecutionPlan-layer wrapper that actually flows samkhya-corrected statistics into DataFusion 46’s physical plan.- stats_
provider - Conversion from
samkhya_core::stats::ColumnStatsto DataFusion’sColumnStatistics. - table_
provider SamkhyaTableProvider— the primary integration point for injecting samkhya-corrected column statistics into DataFusion’s query planning.