Expand description
SamkhyaTableProvider — the primary integration point for injecting
samkhya-corrected column statistics into DataFusion’s query planning.
§Wrapping point: TableProvider::statistics()
DataFusion attaches statistics to table providers, not to logical-plan
nodes. The TableProvider trait exposes a statistics() hook
(returning Option<Statistics>) that the planner consults when reasoning
about cardinality, join order, and filter selectivity. Rewriting a
LogicalPlan to “inject” stats is the wrong layer — that is observe-only
plumbing. The right layer is a TableProvider shim that delegates every
method to an inner provider except statistics(), where it folds in
samkhya’s feedback-driven corrections.
We considered three wrapping points and chose the first:
TableProvider::statistics()(this module). Clean, stable surface in DataFusion 46. The planner calls it during analysis. Every adapter (Parquet, CSV, MemTable, Iceberg) flows through the same hook, so the shim is provider-agnostic.ExecutionPlan::statistics(). Lower in the stack — would require wrapping the scan-sideExecutionPlanreturned fromscan(). Useful when the inner provider’s logical stats are absent but its physical plan has them; not our situation today.OptimizerRulerewritingTableScan::source. The original scaffold direction. The rewrite must construct a newTableSource(the logical counterpart ofTableProvider) — duplicate state, version-fragile, and never propagates into the physical layer where the planner actually consults stats. Kept around as observe-only telemetry (crate::SamkhyaOptimizerRule).
§LpBound posture
Every value translated into DataFusion’s Precision<T> is wrapped as
Precision::Inexact. samkhya’s corrections are feedback-driven
estimates clamped by the LpBound pessimistic ceiling; they are never
exact catalog counts. Inexact is the precision DataFusion’s
cost-based optimizer treats as “use this, but do not assume zero error”.
Structs§
- Samkhya
Table Provider - A
TableProviderwrapper that overridesstatistics()with samkhya-corrected column statistics while delegating every other method to the inner provider.