samkhya_datafusion/lib.rs
1//! samkhya-datafusion — DataFusion adapter for samkhya-core.
2//!
3//! # Integration model
4//!
5//! DataFusion 46.0 has two distinct surfaces that the mainline planner
6//! actually consults for cardinality:
7//!
8//! * [`ExecutionPlan::statistics()`] on each node of the physical plan.
9//! `FilterExec`, `ProjectionExec`, `HashJoinExec` etc. propagate from
10//! their child's `statistics()` upward, so corrections placed at the
11//! leaf scan level reach the top of the plan.
12//! * [`TableProvider::statistics()`] is **not** consulted by the
13//! mainline physical planner: it is reserved for downstream forks /
14//! custom optimizer rules. We still implement it for completeness, but
15//! we do not rely on it as the injection path.
16//!
17//! samkhya therefore wires corrections in at three layers, which together
18//! form the integration model:
19//!
20//! 1. [`SamkhyaTableProvider`] —
21//! a `TableProvider` wrapper that delegates every method to an inner
22//! provider but overrides `statistics()` with samkhya-corrected
23//! [`ColumnStatistics`], and — critically — overrides `scan()` to
24//! return a physical [`SamkhyaStatsExec`]
25//! wrapping the inner provider's exec. The exec wrapper is what
26//! makes `physical.statistics()?.num_rows` reflect samkhya's
27//! corrections, because the mainline planner uses the exec's stats,
28//! not the table provider's.
29//! 2. [`SamkhyaStatsExec`] — a
30//! passthrough [`ExecutionPlan`] that overrides `statistics()` to
31//! return a preset `Statistics`, delegating every other method to the
32//! inner exec. This is the physical-layer hook the planner actually
33//! consults.
34//! 3. [`SamkhyaOptimizerRule`] —
35//! implements both `OptimizerRule` (logical, observe-only) and
36//! `PhysicalOptimizerRule` (physical, validates the wrappers are in
37//! place and surfaces a diagnostic count of `SamkhyaStatsExec`
38//! leaves seen). Registration of the rule is the explicit integration
39//! ceremony — operators audit the
40//! `SessionState::physical_optimizers()` slice to confirm samkhya is
41//! wired in.
42//!
43//! ```ignore
44//! use std::sync::Arc;
45//! use datafusion::execution::session_state::SessionStateBuilder;
46//! use datafusion::execution::context::SessionContext;
47//! use datafusion::prelude::SessionConfig;
48//! use samkhya_datafusion::{SamkhyaOptimizerRule, SamkhyaTableProvider};
49//! use samkhya_core::stats::ColumnStats;
50//!
51//! let rule = Arc::new(SamkhyaOptimizerRule::new());
52//! let state = SessionStateBuilder::new()
53//! .with_config(SessionConfig::new())
54//! .with_default_features()
55//! .with_optimizer_rule(rule.clone())
56//! .with_physical_optimizer_rule(rule.clone())
57//! .build();
58//! let ctx = SessionContext::new_with_state(state);
59//!
60//! let wrapped = SamkhyaTableProvider::new(inner_provider)
61//! .with_column_stats(0, ColumnStats::new().with_row_count(1_000_000));
62//! ctx.register_table("t", Arc::new(wrapped))?;
63//! ```
64//!
65//! All values translated into DataFusion's `Precision<T>` are marked
66//! [`Precision::Inexact`] — samkhya's corrections are feedback-driven,
67//! clamped by the LpBound pessimistic ceiling, and never exact catalog
68//! counts. This is the conservative posture the safety envelope requires.
69//!
70//! # Compatibility
71//!
72//! Compiled and tested against **DataFusion 46.0.1** (released March 2025).
73//! Version 46 is the first release with a stable `OptimizerRule` trait
74//! surface (`name`, `apply_order`, `supports_rewrite`, `rewrite`), the
75//! `PhysicalOptimizerRule` trait, and the `Precision<T>` /
76//! `ColumnStatistics` / `Statistics` types we depend on for cardinality
77//! correction. Newer versions should also work, with any signature drift
78//! caught by the `wrap_provider` integration test and the
79//! `stats_propagation_demo` example binary.
80//!
81//! [`OptimizerRule`]: datafusion::optimizer::OptimizerRule
82//! [`PhysicalOptimizerRule`]: datafusion::physical_optimizer::PhysicalOptimizerRule
83//! [`TableProvider`]: datafusion::datasource::TableProvider
84//! [`TableProvider::statistics()`]: datafusion::datasource::TableProvider::statistics
85//! [`ExecutionPlan`]: datafusion::physical_plan::ExecutionPlan
86//! [`ExecutionPlan::statistics()`]: datafusion::physical_plan::ExecutionPlan::statistics
87//! [`ColumnStatistics`]: datafusion::common::ColumnStatistics
88//! [`Precision::Inexact`]: datafusion::common::stats::Precision::Inexact
89#![deny(rustdoc::broken_intra_doc_links)]
90
91pub mod optimizer_rule;
92pub mod physical_plan;
93pub mod stats_provider;
94pub mod table_provider;
95
96pub use optimizer_rule::SamkhyaOptimizerRule;
97pub use physical_plan::SamkhyaStatsExec;
98pub use stats_provider::to_datafusion_column_statistics;
99pub use table_provider::SamkhyaTableProvider;