//! Dynamic-dispatch pipeline for composing transformers and estimators.
//!
//! A [`Pipeline`] chains zero or more transformer steps followed by a final
//! estimator step. Calling [`Fit::fit`] on a pipeline fits each step in
//! sequence, producing a [`FittedPipeline`] that implements [`Predict`].
//!
//! The pipeline is generic over the float type `F`, supporting both `f32`
//! and `f64` data. All steps in a pipeline must use the same float type.
//! The type parameter defaults to `f64` for backward compatibility.
//!
//! ## REQ status (per `.design/core/pipeline.md`, mirrors `sklearn/pipeline.py` @ 1.5.2)
//!
//! ferrolearn's `Pipeline` is a minimal subset of sklearn's: sequential
//! transformer fit→transform chaining + a single final estimator's fit/predict.
//!
//! | REQ | Status | Evidence |
//! |---|---|---|
//! | REQ-1 (fit→transform chaining + final predict) | SHIPPED | `Fit for Pipeline` (fit each transformer, transform, fit final estimator) mirrors `Pipeline._fit` (`pipeline.py:406`); `Predict for FittedPipeline` mirrors `Pipeline.predict` (`pipeline.py:599`). Non-test consumers: `impl PipelineEstimator for GaussianNB in gaussian.rs`, `impl PipelineEstimator for BernoulliNB in bernoulli.rs`, `impl PipelineTransformer for KernelPCA in kernel_pca.rs`. (critic: fit-then-transform ≡ sklearn fused fit_transform to ≤1.1e-14 on KernelPCA.) |
//! | REQ-2 (no-final-estimator rejected at fit) | SHIPPED | `Fit for Pipeline` returns `FerroError::InvalidParameter` when the estimator slot is unset; matches sklearn requiring a final predictor for `.predict` (`available_if` at `pipeline.py:549`). |
//! | REQ-3 (fit_transform/transform/predict_proba/decision_function/score) | SHIPPED | `Pipeline::fit_transform` (`Fit::fit` then `transform_through`) mirrors `Pipeline.fit_transform` (`pipeline.py:489`); `FittedPipeline::{transform, predict_proba, decision_function, score}` run the private `transform_through` loop (`pipeline.py:599-600`/`:719-720`/`:768-769`/`:999-1000`) then delegate to the final estimator. `predict_proba`/`decision_function`/`score` forward to the new default-`Err` trait methods `predict_proba_pipeline`/`decision_function_pipeline`/`score_pipeline` on `FittedPipelineEstimator` (the `available_if(_final_estimator_has(...))` analog, `pipeline.py:674`/`:731`/`:960`); `transform` returns the transformer-prefix output (sklearn raises `AttributeError` for a non-transformer-final `transform`, `:858`). Non-test consumer: `impl FittedPipelineEstimator for FittedGaussianNBPipeline in gaussian.rs` overrides `predict_proba_pipeline` (→ `predict_proba`) + `score_pipeline` (→ `score`). Live-oracle verification: `gaussian_pipeline_predict_proba_score_match_sklearn` (StandardScaler+GaussianNB pipeline matches sklearn `predict_proba`/`score`/`transform`) + core `test_pipeline_fit_transform_equals_transform`/`test_pipeline_predict_proba_and_score_override`/`test_pipeline_predict_proba_default_is_err`. |
//! | REQ-4a (named_steps / `__getitem__` int+str+slice) | SHIPPED | `Pipeline::{named_steps, get_step, get_step_by_name, named_step, into_slice}` + `FittedPipeline::{named_steps, get_step, get_step_by_name, named_step}` over the existing `transforms`/`estimator` storage; mirror `Pipeline.named_steps` (`pipeline.py:325` `return Bunch(**dict(self.steps))`), integer/string/slice `Pipeline.__getitem__` (`pipeline.py:298-318`). A step is returned as a `PipelineStepRef`/`FittedPipelineStepRef` enum (the heterogeneous-`(name, obj)`-list analog, since a ferrolearn step is EITHER a `PipelineTransformer` OR a `PipelineEstimator`). `into_slice` consumes `self` (the trait-object steps are not `Clone`, so the new sub-pipeline MOVES the contiguous range, vs sklearn's shallow object-sharing copy `:310`). Non-test consumer: pub API on the grandfathered `Pipeline`/`FittedPipeline` boundary types (S5). Live-oracle verification (R-CHAR-3, sklearn 1.5.2): `test_pipeline_named_steps_match_sklearn`, `test_pipeline_get_step_*`, `test_pipeline_into_slice_*`. |
//! | REQ-4b (get_params / set_params `<step>__<param>` nested protocol) | NOT-STARTED | blocker #362. The `PipelineTransformer`/`PipelineEstimator` trait objects expose NO `get_params`/reflection method, so the `_BaseComposition._get_params`/`_set_params` nested addressing (`pipeline.py:216`/`:237`) is not implementable without first adding a per-step reflection trait (e.g. `fn get_params(&self) -> BTreeMap<String, ParamValue>` on the step traits). Concrete blocker: a `get_params`/reflection method on the step traits. |
//! | REQ-5a (passthrough steps) | SHIPPED | `PassthroughTransformer`/`FittedPassthroughTransformer in pipeline.rs` are a reusable identity transformer (`impl PipelineTransformer` `fit_pipeline` → `Box::new(FittedPassthroughTransformer)`; `transform_pipeline(&self, x) → Ok(x.clone())`), the Rust analog of sklearn's `'passthrough'`/`None` step (`sklearn/pipeline.py:251`/`:266` `_validate_steps` allows it, `:275-290` `_iter(filter_passthrough=True)` skips it so `Xt` passes through, `:337` it stays visible in `named_steps`). ferrolearn types the transformer/estimator split, so the no-op IS a concrete identity transformer placed in the chain — no `filter_passthrough` loop branch needed (the step is genuinely identity). Non-test production consumer: the pub `Pipeline::passthrough_step` builder (the `('name','passthrough')` analog, delegating to `transform_step` with a `PassthroughTransformer`), plus the pub API on the `pub mod pipeline` surface (S5 — same boundary as `Pipeline`/`FeatureUnion`, not crate-root re-exported). Live-oracle (R-CHAR-3, sklearn 1.5.2): `Pipeline([('p','passthrough')]).fit(X).transform(X) == X`; passthrough before/after a transformer == the transformer alone; the step appears in `named_steps`/`steps`. Pinned by `test_passthrough_step_is_identity`, `test_passthrough_before_transformer_is_noop`, `test_passthrough_after_transformer_is_noop`, `test_passthrough_step_appears_in_step_names`, `test_passthrough_transformer_standalone_identity`, `test_passthrough_transformer_f32`. |
//! | REQ-5b (memory caching) | NOT-STARTED | blocker #363. No `memory=`/`check_memory`/`fit_transform_one_cached` transformer caching (`sklearn/pipeline.py:388-390`); requires a joblib disk-cache substrate with no ferrolearn analog yet. |
//! | REQ-6 (fit_params / metadata routing) | NOT-STARTED | blocker #364. |
//! | REQ-7 (make_pipeline auto-naming helper) | NOT-STARTED | blocker #365 (`pipeline.py:1220`). |
//! | REQ-8 (FeatureUnion) | SHIPPED | `FeatureUnion`/`FittedFeatureUnion` in `pipeline.rs`: `impl Fit<Array2<F>, ()> for FeatureUnion` fits each named sub-transformer on the SAME `x` (mirrors `FeatureUnion.fit` fitting every transformer on `X`, `pipeline.py:1643`/`:1681`) recording each output width; the fit also validates transformer-name uniqueness up front (mirrors `_validate_transformers` → `_validate_names`, `pipeline.py:1523-1525` → `sklearn/utils/metaestimators.py:81-83`): a duplicate name returns `FerroError::InvalidParameter` (sklearn's `ValueError: Names provided are not unique` analog) instead of fitting; `impl Transform<Array2<F>> for FittedFeatureUnion` transforms `x` through each and horizontally concatenates the column blocks left-to-right in list order (mirrors `FeatureUnion.transform` → `_hstack`, `pipeline.py:1770`/`:1812` `np.hstack(Xs)`); `FittedFeatureUnion::get_feature_names_out` prefixes each block's positional `x{j}` with `{name}__` (the `verbose_feature_names_out=True` default, `pipeline.py:1567`/`:1608-1616`). Non-test consumer: the pub API on the `pub mod pipeline` surface (S5 — the same boundary the grandfathered `Pipeline`/`FittedPipeline` types live on; neither is crate-root re-exported). Live-oracle (sklearn 1.5.2): `FeatureUnion([('ss',StandardScaler()),('mm',MinMaxScaler())])` on `[[1,2],[3,4],[5,6]]` → `(3,4)` with column blocks `[ss|mm]` and names `['ss__x0','ss__x1','mm__x0','mm__x1']`. NOT-STARTED (no ferrolearn analog yet): `transformer_weights` per-output scaling (`pipeline.py:1369`), the `'drop'`/`'passthrough'` sentinels (`:1530`/`:1563`), `n_jobs` parallelism (`:1360`), metadata routing (`:1859`), `verbose_feature_names_out=False` non-prefixed mode (`:1618-1641`), and the ferray substrate (typed on `ndarray::{Array1,Array2}`). |
//! | REQ-9 (ferray substrate) | NOT-STARTED | blocker #367 — data flow typed on `ndarray::{Array1,Array2}`; cascades (R-SUBSTRATE-4). |
//!
//! acto-critic verdict: NO DIVERGENCE FOUND in the implemented surface (chaining,
//! y-threading, estimator-only predict, and the REQ-3 apply methods
//! — `fit_transform`/`transform`/`predict_proba`/`decision_function`/`score` —
//! all match the live sklearn oracle; `transform` over a non-transformer-final
//! pipeline returns the transformer-prefix output, the structural analog of
//! sklearn's `available_if(_can_transform)` `AttributeError`). Two states only
//! per goal.md R-DEFER-2.
//!
//! # Examples
//!
//! ```
//! use ferrolearn_core::pipeline::{Pipeline, PipelineTransformer, PipelineEstimator};
//! use ferrolearn_core::{Fit, Predict, FerroError};
//! use ndarray::{Array1, Array2};
//!
//! // A trivial identity transformer for demonstration.
//! struct IdentityTransformer;
//!
//! impl PipelineTransformer<f64> for IdentityTransformer {
//! fn fit_pipeline(
//! &self,
//! x: &Array2<f64>,
//! _y: &Array1<f64>,
//! ) -> Result<Box<dyn FittedPipelineTransformer<f64>>, FerroError> {
//! Ok(Box::new(FittedIdentity))
//! }
//! }
//!
//! struct FittedIdentity;
//!
//! impl FittedPipelineTransformer<f64> for FittedIdentity {
//! fn transform_pipeline(&self, x: &Array2<f64>) -> Result<Array2<f64>, FerroError> {
//! Ok(x.clone())
//! }
//! }
//!
//! // A trivial estimator that predicts the first column.
//! struct FirstColumnEstimator;
//!
//! impl PipelineEstimator<f64> for FirstColumnEstimator {
//! fn fit_pipeline(
//! &self,
//! _x: &Array2<f64>,
//! _y: &Array1<f64>,
//! ) -> Result<Box<dyn FittedPipelineEstimator<f64>>, FerroError> {
//! Ok(Box::new(FittedFirstColumn))
//! }
//! }
//!
//! struct FittedFirstColumn;
//!
//! impl FittedPipelineEstimator<f64> for FittedFirstColumn {
//! fn predict_pipeline(&self, x: &Array2<f64>) -> Result<Array1<f64>, FerroError> {
//! Ok(x.column(0).to_owned())
//! }
//! }
//!
//! // Build and use the pipeline.
//! use ferrolearn_core::pipeline::FittedPipelineTransformer;
//! use ferrolearn_core::pipeline::FittedPipelineEstimator;
//!
//! let pipeline = Pipeline::new()
//! .transform_step("scaler", Box::new(IdentityTransformer))
//! .estimator_step("model", Box::new(FirstColumnEstimator));
//!
//! let x = Array2::<f64>::zeros((5, 3));
//! let y = Array1::<f64>::zeros(5);
//!
//! let fitted = pipeline.fit(&x, &y).unwrap();
//! let preds = fitted.predict(&x).unwrap();
//! assert_eq!(preds.len(), 5);
//! ```
use ndarray::{Array1, Array2};
use num_traits::Float;
use crate::dataset::check_consistent_length;
use crate::error::FerroError;
use crate::traits::{Fit, Predict, Transform};
// ---------------------------------------------------------------------------
// Trait-object interfaces for pipeline steps
// ---------------------------------------------------------------------------
/// An unfitted transformer step that can participate in a [`Pipeline`].
///
/// Implementors must be able to fit themselves on `Array2<F>` data and
/// return a boxed [`FittedPipelineTransformer`].
///
/// The type parameter `F` is the float type (`f32` or `f64`).
pub trait PipelineTransformer<F: Float + Send + Sync + 'static>: Send + Sync {
/// Fit this transformer on the given data.
///
/// # Errors
///
/// Returns a [`FerroError`] if fitting fails.
fn fit_pipeline(
&self,
x: &Array2<F>,
y: &Array1<F>,
) -> Result<Box<dyn FittedPipelineTransformer<F>>, FerroError>;
}
/// A fitted transformer step in a [`FittedPipeline`].
///
/// Transforms `Array2<F>` data, producing a new `Array2<F>`.
pub trait FittedPipelineTransformer<F: Float + Send + Sync + 'static>: Send + Sync {
/// Transform the input data.
///
/// # Errors
///
/// Returns a [`FerroError`] if the input shape is incompatible.
fn transform_pipeline(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError>;
}
/// An unfitted estimator step that serves as the final step in a [`Pipeline`].
///
/// Implementors must be able to fit themselves on `Array2<F>` data and
/// return a boxed [`FittedPipelineEstimator`].
pub trait PipelineEstimator<F: Float + Send + Sync + 'static>: Send + Sync {
/// Fit this estimator on the given data.
///
/// # Errors
///
/// Returns a [`FerroError`] if fitting fails.
fn fit_pipeline(
&self,
x: &Array2<F>,
y: &Array1<F>,
) -> Result<Box<dyn FittedPipelineEstimator<F>>, FerroError>;
}
/// A fitted estimator step in a [`FittedPipeline`].
///
/// Produces `Array1<F>` predictions from `Array2<F>` input.
///
/// The three delegating methods below — `predict_proba_pipeline`,
/// `decision_function_pipeline`, `score_pipeline` — mirror the way sklearn's
/// `Pipeline` forwards to the final estimator's `predict_proba` /
/// `decision_function` / `score` (`sklearn/pipeline.py:675`, `:731`, `:961`).
/// scikit-learn gates each pipeline method on the final estimator actually
/// having the attribute via `available_if(_final_estimator_has(...))`
/// (`sklearn/pipeline.py:674`, `:731`, `:960`); a final estimator that lacks
/// the method raises `AttributeError`. ferrolearn cannot express
/// `available_if` over a trait object, so each method ships a DEFAULT impl that
/// returns [`FerroError::InvalidParameter`] (the closest analog of sklearn's
/// `AttributeError`). A concrete estimator that DOES support the operation
/// overrides the corresponding method.
pub trait FittedPipelineEstimator<F: Float + Send + Sync + 'static>: Send + Sync {
/// Generate predictions for the input data.
///
/// # Errors
///
/// Returns a [`FerroError`] if the input shape is incompatible.
fn predict_pipeline(&self, x: &Array2<F>) -> Result<Array1<F>, FerroError>;
/// Class-probability estimates for the input data, shape
/// `(n_samples, n_classes)`.
///
/// Mirrors the final-estimator delegation of `Pipeline.predict_proba`
/// (`sklearn/pipeline.py:721`: `self.steps[-1][1].predict_proba(Xt)`).
///
/// # Errors
///
/// The default implementation returns [`FerroError::InvalidParameter`] —
/// the analog of sklearn raising `AttributeError` when the final estimator
/// has no `predict_proba`. Estimators that support probability estimates
/// override this method.
fn predict_proba_pipeline(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let _ = x;
Err(FerroError::InvalidParameter {
name: "predict_proba".into(),
reason: "the final estimator of this pipeline does not support predict_proba".into(),
})
}
/// Confidence scores (decision function) for the input data, shape
/// `(n_samples, n_classes)` (or `(n_samples,)` for binary, per the
/// estimator's contract).
///
/// Mirrors the final-estimator delegation of `Pipeline.decision_function`
/// (`sklearn/pipeline.py:772`: `self.steps[-1][1].decision_function(Xt)`).
///
/// # Errors
///
/// The default implementation returns [`FerroError::InvalidParameter`] —
/// the analog of sklearn raising `AttributeError` when the final estimator
/// has no `decision_function`. Estimators that expose a decision function
/// override this method.
fn decision_function_pipeline(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let _ = x;
Err(FerroError::InvalidParameter {
name: "decision_function".into(),
reason: "the final estimator of this pipeline does not support decision_function"
.into(),
})
}
/// Score the final estimator on `(x, y)`, returning a single scalar
/// (e.g. mean accuracy for a classifier, R² for a regressor).
///
/// Mirrors the final-estimator delegation of `Pipeline.score`
/// (`sklearn/pipeline.py:1004`: `self.steps[-1][1].score(Xt, y)`).
///
/// # Errors
///
/// The default implementation returns [`FerroError::InvalidParameter`] —
/// the analog of sklearn raising `AttributeError` when the final estimator
/// has no `score`. Estimators that support scoring override this method.
fn score_pipeline(&self, x: &Array2<F>, y: &Array1<F>) -> Result<F, FerroError> {
let _ = (x, y);
Err(FerroError::InvalidParameter {
name: "score".into(),
reason: "the final estimator of this pipeline does not support score".into(),
})
}
}
// ---------------------------------------------------------------------------
// Pipeline (unfitted)
// ---------------------------------------------------------------------------
/// A named transformer step in an unfitted pipeline.
struct TransformStep<F: Float + Send + Sync + 'static> {
/// Human-readable name for this step.
name: String,
/// The unfitted transformer.
step: Box<dyn PipelineTransformer<F>>,
}
/// A borrowed reference to a single step of an unfitted [`Pipeline`].
///
/// sklearn's `Pipeline.steps` is a flat list of `(name, obj)` tuples where
/// every `obj` is duck-typed; `Pipeline.__getitem__` with an integer or string
/// returns that single `obj` (`sklearn/pipeline.py:298-318`). ferrolearn encodes
/// the transformer/estimator distinction in the type system, so a "step" is
/// EITHER a [`PipelineTransformer`] OR a [`PipelineEstimator`]. This enum is the
/// heterogeneous-step analog: the variant tells the caller which kind of step
/// they reached, mirroring sklearn returning the underlying object.
pub enum PipelineStepRef<'a, F: Float + Send + Sync + 'static> {
/// A transformer step (an intermediate step of the pipeline).
Transformer(&'a dyn PipelineTransformer<F>),
/// The final estimator step.
Estimator(&'a dyn PipelineEstimator<F>),
}
/// A dynamic-dispatch pipeline that composes transformers and a final estimator.
///
/// Steps are added with [`transform_step`](Pipeline::transform_step) and the
/// final estimator is set with [`estimator_step`](Pipeline::estimator_step).
/// The pipeline implements [`Fit<Array2<F>, Array1<F>>`](Fit) and produces
/// a [`FittedPipeline`] that implements [`Predict<Array2<F>>`](Predict).
///
/// All intermediate data flows as `Array2<F>`. The type parameter defaults
/// to `f64` for backward compatibility.
pub struct Pipeline<F: Float + Send + Sync + 'static = f64> {
/// Ordered transformer steps.
transforms: Vec<TransformStep<F>>,
/// The final estimator step (name + estimator).
estimator: Option<(String, Box<dyn PipelineEstimator<F>>)>,
}
impl<F: Float + Send + Sync + 'static> Pipeline<F> {
/// Create a new empty pipeline.
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::Pipeline;
/// let pipeline = Pipeline::<f64>::new();
/// ```
pub fn new() -> Self {
Self {
transforms: Vec::new(),
estimator: None,
}
}
/// Add a named transformer step to the pipeline.
///
/// Transformer steps are applied in the order they are added, before
/// the final estimator step.
#[must_use]
pub fn transform_step(mut self, name: &str, step: Box<dyn PipelineTransformer<F>>) -> Self {
self.transforms.push(TransformStep {
name: name.to_owned(),
step,
});
self
}
/// Add a named `'passthrough'` (identity no-op) transformer step.
///
/// This is the ergonomic analog of an sklearn `('name', 'passthrough')` step:
/// a transformer that leaves the running data unchanged but is still a real,
/// named step (visible in [`step_names`](Pipeline::step_names) /
/// [`named_steps`](Pipeline::named_steps)). It delegates to
/// [`transform_step`](Pipeline::transform_step) with a
/// [`PassthroughTransformer`], so a passthrough step placed anywhere in the
/// chain is a genuine no-op — fitting skips it and transforming passes `Xt`
/// through unchanged, mirroring sklearn's `_iter(filter_passthrough=True)`
/// dropping `'passthrough'` (`sklearn/pipeline.py:289`) while
/// `named_steps`/`__getitem__` still show it (`:337`).
#[must_use]
pub fn passthrough_step(self, name: &str) -> Self {
self.transform_step(name, Box::new(PassthroughTransformer::<F>::new()))
}
/// Set the final estimator step.
///
/// A pipeline must have exactly one estimator step. Setting a new
/// estimator replaces any previously set estimator.
#[must_use]
pub fn estimator_step(mut self, name: &str, estimator: Box<dyn PipelineEstimator<F>>) -> Self {
self.estimator = Some((name.to_owned(), estimator));
self
}
/// Add a named step to the pipeline using the builder pattern.
///
/// This is a convenience method that accepts either a transformer or
/// an estimator. The final step added via this method that is an
/// estimator becomes the pipeline's estimator. This provides the
/// `Pipeline::new().step("scaler", ...).step("clf", ...)` API.
#[must_use]
pub fn step(self, name: &str, step: Box<dyn PipelineStep<F>>) -> Self {
step.add_to_pipeline(self, name)
}
/// Fit the pipeline and return both the [`FittedPipeline`] and the data
/// after every transformer step has been applied.
///
/// This mirrors `Pipeline.fit_transform` (`sklearn/pipeline.py:489-547`):
/// `Xt = self._fit(X, y)` fits each transformer on the running `Xt` and
/// applies it, then the result is the transformed data. sklearn ALSO calls
/// the final estimator's `fit_transform`/`fit().transform()` when the final
/// step is itself a transformer (`:540-547`); ferrolearn's final slot is a
/// non-transformer estimator, so — like its [`FittedPipeline::transform`] —
/// `fit_transform` returns the data after the transformer prefix, with the
/// estimator still fit (as in `fit`). The returned `Array2<F>` equals
/// [`FittedPipeline::transform`] applied to the same `x` (fit-then-transform
/// ≡ sklearn's fused `fit_transform`, established for REQ-1).
///
/// # Errors
///
/// Returns [`FerroError::InvalidParameter`] if no estimator step was set
/// (delegates to [`Fit::fit`]). Propagates any errors from individual step
/// fitting or transforming.
pub fn fit_transform(
&self,
x: &Array2<F>,
y: &Array1<F>,
) -> Result<(FittedPipeline<F>, Array2<F>), FerroError> {
let fitted = self.fit(x, y)?;
let transformed = fitted.transform_through(x)?;
Ok((fitted, transformed))
}
/// Number of steps in the pipeline (transformer steps plus the final
/// estimator, if set).
///
/// Mirrors `Pipeline.__len__` (`sklearn/pipeline.py:292-296`:
/// `return len(self.steps)`).
#[must_use]
pub fn len(&self) -> usize {
self.transforms.len() + usize::from(self.estimator.is_some())
}
/// Returns `true` if the pipeline has no steps at all.
#[must_use]
pub fn is_empty(&self) -> bool {
self.len() == 0
}
/// Returns the names of all steps (transformers, then the estimator if set)
/// in pipeline order.
///
/// Mirrors the key ordering of `Pipeline.named_steps`
/// (`sklearn/pipeline.py:325`: `Bunch(**dict(self.steps))` keyed by step
/// name in `steps` order).
#[must_use]
pub fn step_names(&self) -> Vec<&str> {
let mut names: Vec<&str> = self.transforms.iter().map(|s| s.name.as_str()).collect();
if let Some((name, _)) = &self.estimator {
names.push(name.as_str());
}
names
}
/// Access every step by its name, in pipeline order, as a
/// `(name, step)` list.
///
/// This is the trait-object analog of sklearn's `Pipeline.named_steps`,
/// which returns a `Bunch(**dict(self.steps))` — a name→step mapping
/// (`sklearn/pipeline.py:325`). Every step (each transformer, then the final
/// estimator if set) is reachable by its construction name. ferrolearn
/// returns an ordered `Vec` of `(name, PipelineStepRef)` rather than a hash
/// map so the pipeline order is preserved and the heterogeneous
/// transformer/estimator kinds are distinguishable.
#[must_use]
pub fn named_steps(&self) -> Vec<(&str, PipelineStepRef<'_, F>)> {
let mut steps: Vec<(&str, PipelineStepRef<'_, F>)> = self
.transforms
.iter()
.map(|s| {
(
s.name.as_str(),
PipelineStepRef::Transformer(s.step.as_ref()),
)
})
.collect();
if let Some((name, est)) = &self.estimator {
steps.push((name.as_str(), PipelineStepRef::Estimator(est.as_ref())));
}
steps
}
/// Look up a single step by name.
///
/// This is the string-key arm of sklearn's `Pipeline.__getitem__`
/// (`sklearn/pipeline.py:317`: `return self.named_steps[ind]`), which raises
/// `KeyError` for an unknown name; ferrolearn returns `None` (R-CODE-2: no
/// panic).
#[must_use]
pub fn named_step(&self, name: &str) -> Option<PipelineStepRef<'_, F>> {
if let Some(ts) = self.transforms.iter().find(|s| s.name == name) {
return Some(PipelineStepRef::Transformer(ts.step.as_ref()));
}
match &self.estimator {
Some((est_name, est)) if est_name == name => {
Some(PipelineStepRef::Estimator(est.as_ref()))
}
_ => None,
}
}
/// Get the step at position `index` (0-based, transformer steps first then
/// the final estimator).
///
/// This is the integer arm of sklearn's `Pipeline.__getitem__`
/// (`sklearn/pipeline.py:313-318`: `name, est = self.steps[ind]; return
/// est`), which raises `IndexError` out of range; ferrolearn returns `None`
/// (R-CODE-2: no panic).
#[must_use]
pub fn get_step(&self, index: usize) -> Option<PipelineStepRef<'_, F>> {
let n_transforms = self.transforms.len();
if index < n_transforms {
return Some(PipelineStepRef::Transformer(
self.transforms[index].step.as_ref(),
));
}
if index == n_transforms
&& let Some((_, est)) = &self.estimator
{
return Some(PipelineStepRef::Estimator(est.as_ref()));
}
None
}
/// Look up a single step by name (alias of [`named_step`](Pipeline::named_step)).
///
/// Provided for symmetry with [`get_step`](Pipeline::get_step); mirrors the
/// string arm of `Pipeline.__getitem__` (`sklearn/pipeline.py:317`).
#[must_use]
pub fn get_step_by_name(&self, name: &str) -> Option<PipelineStepRef<'_, F>> {
self.named_step(name)
}
/// Build a sub-pipeline from the contiguous step range `[start, end)`,
/// consuming `self`.
///
/// This is the slice arm of sklearn's `Pipeline.__getitem__`
/// (`sklearn/pipeline.py:307-312`): `pipe[a:b]` returns
/// `Pipeline(self.steps[a:b], ...)` — a new pipeline over the contiguous
/// step range. sklearn slicing supports only a step of 1
/// (`:308-309`, otherwise `ValueError`); a contiguous Rust range is the step-1
/// analog by construction.
///
/// The sliced steps are addressed in the unified order
/// (transformer steps `0..n_transforms`, then the estimator at
/// `n_transforms` if set), matching [`get_step`](Pipeline::get_step). A slice
/// that includes the estimator index keeps it as the final estimator; a slice
/// of only transformer indices yields an estimator-less pipeline (valid to
/// build, errors only at `fit` — mirroring sklearn, where `pipe[:k]` for a
/// transformer-only range is a `Pipeline` that simply lacks `.predict`).
///
/// # Divergence from sklearn
///
/// sklearn's slice is a SHALLOW copy that shares the underlying estimator
/// objects with the original pipeline (`sklearn/pipeline.py:303-305`). The
/// ferrolearn step trait objects are not `Clone`, so this method MOVES the
/// selected boxed steps into the new pipeline and therefore consumes `self`.
/// Slicing a [`FittedPipeline`] is NOT implemented for the same reason (the
/// fitted step trait objects are not `Clone`); it is NOT-STARTED under
/// blocker #362.
///
/// Out-of-range bounds CLAMP and `start > end` yields an empty pipeline —
/// Python list-slice semantics, mirroring sklearn `Pipeline.__getitem__`'s
/// slice arm which slices `self.steps[ind]` (`pipeline.py:307-312`): an
/// ordinary Python slice never raises on out-of-range bounds (#2235). So
/// `into_slice(0, 100)` on 3 steps → all 3, `into_slice(5, 100)` → empty,
/// `into_slice(2, 1)` → empty. This is a TOTAL function (it cannot fail).
#[must_use]
pub fn into_slice(self, start: usize, end: usize) -> Pipeline<F> {
let n_steps = self.len();
// Python slice clamping: `end` past the length is clamped to the length;
// a `start >= end` (incl. start past the length) yields an empty range
// via the `idx >= start && idx < end` filter below.
let end = end.min(n_steps);
let Pipeline {
transforms,
estimator,
} = self;
let n_transforms = transforms.len();
let mut new_transforms = Vec::new();
let mut new_estimator = None;
for (idx, ts) in transforms.into_iter().enumerate() {
if idx >= start && idx < end {
new_transforms.push(ts);
}
}
// The estimator (if set) sits at unified index `n_transforms`.
if let Some(est) = estimator
&& n_transforms >= start
&& n_transforms < end
{
new_estimator = Some(est);
}
Pipeline {
transforms: new_transforms,
estimator: new_estimator,
}
}
}
impl<F: Float + Send + Sync + 'static> Default for Pipeline<F> {
fn default() -> Self {
Self::new()
}
}
impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, Array1<F>> for Pipeline<F> {
type Fitted = FittedPipeline<F>;
type Error = FerroError;
/// Fit the pipeline by fitting each transformer step in order, then
/// fitting the final estimator on the transformed data.
///
/// Each transformer is fit on the current data, then the data is
/// transformed before being passed to the next step.
///
/// Before fitting any step, the pipeline validates that `x` and `y` have a
/// consistent number of samples via
/// [`check_consistent_length`](crate::dataset::check_consistent_length),
/// mirroring scikit-learn's `Pipeline.fit`, which runs every step through
/// input validation (`check_X_y` → `check_consistent_length`,
/// `sklearn/utils/validation.py:1320`) and rejects `X`/`y` with mismatched
/// `n_samples` before fitting (`sklearn/pipeline.py:406` `_fit`). A pipeline
/// therefore rejects inconsistent `X`/`y` up front rather than failing
/// inside a step's `fit_pipeline`.
///
/// # Errors
///
/// Returns [`FerroError::InvalidParameter`] if no estimator step was set, or
/// [`FerroError::ShapeMismatch`] if `x.nrows() != y.len()`. Propagates any
/// errors from individual step fitting or transforming.
fn fit(&self, x: &Array2<F>, y: &Array1<F>) -> Result<FittedPipeline<F>, FerroError> {
if self.estimator.is_none() {
return Err(FerroError::InvalidParameter {
name: "estimator".into(),
reason: "pipeline must have a final estimator step".into(),
});
}
// sklearn validates X/y sample-count consistency before fitting any
// step (`check_consistent_length`, `sklearn/utils/validation.py:1320`).
check_consistent_length(x.nrows(), y.len())?;
let mut current_x = x.clone();
let mut fitted_transforms = Vec::with_capacity(self.transforms.len());
// Fit and transform each transformer step.
for ts in &self.transforms {
let fitted = ts.step.fit_pipeline(¤t_x, y)?;
current_x = fitted.transform_pipeline(¤t_x)?;
fitted_transforms.push(FittedTransformStep {
name: ts.name.clone(),
step: fitted,
});
}
// Fit the final estimator on the transformed data.
let (est_name, est) = self.estimator.as_ref().unwrap();
let fitted_est = est.fit_pipeline(¤t_x, y)?;
Ok(FittedPipeline {
transforms: fitted_transforms,
estimator: (est_name.clone(), fitted_est),
})
}
}
// ---------------------------------------------------------------------------
// FittedPipeline
// ---------------------------------------------------------------------------
/// A named fitted transformer step.
struct FittedTransformStep<F: Float + Send + Sync + 'static> {
/// Human-readable name for this step.
name: String,
/// The fitted transformer.
step: Box<dyn FittedPipelineTransformer<F>>,
}
/// A borrowed reference to a single step of a [`FittedPipeline`].
///
/// The fitted analog of [`PipelineStepRef`]: a fitted step is EITHER a
/// [`FittedPipelineTransformer`] (an intermediate step) OR the
/// [`FittedPipelineEstimator`] (the final step). Returned by the
/// `FittedPipeline` `named_steps` / `get_step` / `named_step` accessors, the
/// fitted analog of sklearn's `Pipeline.__getitem__` over a fitted pipeline
/// (`sklearn/pipeline.py:298-318`).
pub enum FittedPipelineStepRef<'a, F: Float + Send + Sync + 'static> {
/// A fitted transformer step.
Transformer(&'a dyn FittedPipelineTransformer<F>),
/// The fitted final estimator step.
Estimator(&'a dyn FittedPipelineEstimator<F>),
}
/// A fitted pipeline that chains fitted transformers and a fitted estimator.
///
/// Created by calling [`Fit::fit`] on a [`Pipeline`]. Implements
/// [`Predict<Array2<F>>`](Predict), producing `Array1<F>` predictions.
pub struct FittedPipeline<F: Float + Send + Sync + 'static = f64> {
/// Fitted transformer steps, in order.
transforms: Vec<FittedTransformStep<F>>,
/// The fitted estimator (name + estimator).
estimator: (String, Box<dyn FittedPipelineEstimator<F>>),
}
impl<F: Float + Send + Sync + 'static> FittedPipeline<F> {
/// Returns the names of all steps (transformers + estimator) in order.
pub fn step_names(&self) -> Vec<&str> {
let mut names: Vec<&str> = self.transforms.iter().map(|s| s.name.as_str()).collect();
names.push(&self.estimator.0);
names
}
/// Number of steps in the fitted pipeline (every transformer step plus the
/// final estimator).
///
/// Mirrors `Pipeline.__len__` (`sklearn/pipeline.py:292-296`). A
/// `FittedPipeline` always has exactly one final estimator (the type
/// guarantees it), so this is never zero.
#[must_use]
pub fn len(&self) -> usize {
self.transforms.len() + 1
}
/// Always `false`: a fitted pipeline always has at least its final
/// estimator step.
#[must_use]
pub fn is_empty(&self) -> bool {
false
}
/// Access every fitted step by its name, in pipeline order, as a
/// `(name, step)` list.
///
/// The fitted analog of sklearn's `Pipeline.named_steps`
/// (`sklearn/pipeline.py:325`: `Bunch(**dict(self.steps))`) — every fitted
/// step (each transformer, then the final estimator) is reachable by its
/// construction name, in pipeline order.
#[must_use]
pub fn named_steps(&self) -> Vec<(&str, FittedPipelineStepRef<'_, F>)> {
let mut steps: Vec<(&str, FittedPipelineStepRef<'_, F>)> = self
.transforms
.iter()
.map(|s| {
(
s.name.as_str(),
FittedPipelineStepRef::Transformer(s.step.as_ref()),
)
})
.collect();
steps.push((
self.estimator.0.as_str(),
FittedPipelineStepRef::Estimator(self.estimator.1.as_ref()),
));
steps
}
/// Look up a single fitted step by name.
///
/// The fitted analog of the string arm of `Pipeline.__getitem__`
/// (`sklearn/pipeline.py:317`); returns `None` for an unknown name (R-CODE-2:
/// no panic, vs sklearn's `KeyError`).
#[must_use]
pub fn named_step(&self, name: &str) -> Option<FittedPipelineStepRef<'_, F>> {
if let Some(ts) = self.transforms.iter().find(|s| s.name == name) {
return Some(FittedPipelineStepRef::Transformer(ts.step.as_ref()));
}
if self.estimator.0 == name {
return Some(FittedPipelineStepRef::Estimator(self.estimator.1.as_ref()));
}
None
}
/// Get the fitted step at position `index` (0-based, transformer steps
/// first then the final estimator).
///
/// The fitted analog of the integer arm of `Pipeline.__getitem__`
/// (`sklearn/pipeline.py:313-318`); returns `None` out of range (R-CODE-2: no
/// panic, vs sklearn's `IndexError`).
#[must_use]
pub fn get_step(&self, index: usize) -> Option<FittedPipelineStepRef<'_, F>> {
let n_transforms = self.transforms.len();
if index < n_transforms {
return Some(FittedPipelineStepRef::Transformer(
self.transforms[index].step.as_ref(),
));
}
if index == n_transforms {
return Some(FittedPipelineStepRef::Estimator(self.estimator.1.as_ref()));
}
None
}
/// Look up a single fitted step by name (alias of
/// [`named_step`](FittedPipeline::named_step)).
///
/// Mirrors the string arm of `Pipeline.__getitem__`
/// (`sklearn/pipeline.py:317`).
#[must_use]
pub fn get_step_by_name(&self, name: &str) -> Option<FittedPipelineStepRef<'_, F>> {
self.named_step(name)
}
/// Run `x` through every fitted transformer step in order, returning the
/// fully transformed data (the data the final estimator sees).
///
/// This is the shared `for ...: Xt = transform.transform(Xt)` loop of
/// sklearn's `Pipeline.predict` / `predict_proba` / `decision_function` /
/// `score` (`sklearn/pipeline.py:599-600`, `:719-720`, `:768-769`,
/// `:999-1000`), which run the data through every non-final transformer
/// before delegating to the final estimator.
///
/// # Errors
///
/// Propagates any [`FerroError`] from an individual transformer step.
fn transform_through(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let mut current_x = x.clone();
for ts in &self.transforms {
current_x = ts.step.transform_pipeline(¤t_x)?;
}
Ok(current_x)
}
/// Apply every fitted transformer step to `x`, returning the transformed
/// data without invoking the final estimator.
///
/// This mirrors `Pipeline.transform` (`sklearn/pipeline.py:863-904`) for the
/// *transformer-final* case. sklearn gates `transform` on
/// `_can_transform` (`:858`): it is only available when the final step is
/// itself a transformer, in which case it runs the data through ALL steps
/// including the last (`for _, name, transform in self._iter(): Xt =
/// transform.transform(Xt)`). When the final step is a non-transformer
/// estimator (e.g. `GaussianNB`), sklearn raises `AttributeError`
/// (`'Pipeline' has no attribute 'transform'`, verified against the live
/// 1.5.2 oracle).
///
/// ferrolearn's [`FittedPipeline`] structurally separates the transformer
/// steps from a single non-transformer estimator slot (the estimator is
/// reached via [`predict_pipeline`](FittedPipelineEstimator::predict_pipeline),
/// not `transform_pipeline`). Therefore `transform` applies exactly the
/// transformer steps and returns the data the final estimator would see —
/// equivalent to sklearn's transformer-final `transform` over the
/// transformer prefix. The estimator slot is never a transformer, so there
/// is no "transform the final step too" branch to mirror.
///
/// # Errors
///
/// Propagates any [`FerroError`] from a transformer step (e.g. a feature
/// count mismatch).
pub fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
self.transform_through(x)
}
/// Transform `x` through every fitted transformer step, then return the
/// final estimator's class-probability estimates, shape
/// `(n_samples, n_classes)`.
///
/// Mirrors `Pipeline.predict_proba` (`sklearn/pipeline.py:716-721`): run the
/// data through every non-final transformer, then
/// `self.steps[-1][1].predict_proba(Xt)`.
///
/// # Errors
///
/// Propagates transformer-step errors; returns [`FerroError::InvalidParameter`]
/// if the final estimator does not support `predict_proba` (sklearn's
/// `AttributeError` analog).
pub fn predict_proba(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let xt = self.transform_through(x)?;
self.estimator.1.predict_proba_pipeline(&xt)
}
/// Transform `x` through every fitted transformer step, then return the
/// final estimator's decision-function scores.
///
/// Mirrors `Pipeline.decision_function` (`sklearn/pipeline.py:767-774`): run
/// the data through every non-final transformer, then
/// `self.steps[-1][1].decision_function(Xt)`.
///
/// # Errors
///
/// Propagates transformer-step errors; returns [`FerroError::InvalidParameter`]
/// if the final estimator does not support `decision_function` (sklearn's
/// `AttributeError` analog).
pub fn decision_function(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let xt = self.transform_through(x)?;
self.estimator.1.decision_function_pipeline(&xt)
}
/// Transform `x` through every fitted transformer step, then return the
/// final estimator's score on `(Xt, y)` (e.g. mean accuracy for a
/// classifier).
///
/// Mirrors `Pipeline.score` (`sklearn/pipeline.py:997-1004`): run the data
/// through every non-final transformer, then
/// `self.steps[-1][1].score(Xt, y)`. ferrolearn does not yet thread
/// `sample_weight` (sklearn's optional third argument, `:961`); that is part
/// of the metadata-routing surface (REQ-6, blocker #364).
///
/// # Errors
///
/// Propagates transformer-step errors; returns [`FerroError::InvalidParameter`]
/// if the final estimator does not support `score` (sklearn's
/// `AttributeError` analog).
pub fn score(&self, x: &Array2<F>, y: &Array1<F>) -> Result<F, FerroError> {
let xt = self.transform_through(x)?;
self.estimator.1.score_pipeline(&xt, y)
}
}
impl<F: Float + Send + Sync + 'static> Predict<Array2<F>> for FittedPipeline<F> {
type Output = Array1<F>;
type Error = FerroError;
/// Generate predictions by transforming the input through each fitted
/// transformer step, then calling predict on the fitted estimator.
///
/// # Errors
///
/// Propagates any errors from transformer or estimator steps.
fn predict(&self, x: &Array2<F>) -> Result<Array1<F>, FerroError> {
let current_x = self.transform_through(x)?;
self.estimator.1.predict_pipeline(¤t_x)
}
}
// ---------------------------------------------------------------------------
// PipelineStep: unified interface for the `.step()` builder method
// ---------------------------------------------------------------------------
/// A trait that unifies transformers and estimators for the
/// [`Pipeline::step`] builder method.
///
/// Implementors of [`PipelineTransformer`] and [`PipelineEstimator`]
/// automatically get a blanket implementation of this trait via the
/// wrapper types [`TransformerStepWrapper`] and [`EstimatorStepWrapper`].
///
/// For convenience, use [`as_transform_step`] and [`as_estimator_step`]
/// to wrap your types.
pub trait PipelineStep<F: Float + Send + Sync + 'static>: Send + Sync {
/// Add this step to the pipeline under the given name.
///
/// Transformer steps are added as intermediate transform steps.
/// Estimator steps are set as the final estimator.
fn add_to_pipeline(self: Box<Self>, pipeline: Pipeline<F>, name: &str) -> Pipeline<F>;
}
/// Wraps a [`PipelineTransformer`] to implement [`PipelineStep`].
///
/// Created by [`as_transform_step`].
pub struct TransformerStepWrapper<F: Float + Send + Sync + 'static>(
Box<dyn PipelineTransformer<F>>,
);
impl<F: Float + Send + Sync + 'static> PipelineStep<F> for TransformerStepWrapper<F> {
fn add_to_pipeline(self: Box<Self>, pipeline: Pipeline<F>, name: &str) -> Pipeline<F> {
pipeline.transform_step(name, self.0)
}
}
/// Wraps a [`PipelineEstimator`] to implement [`PipelineStep`].
///
/// Created by [`as_estimator_step`].
pub struct EstimatorStepWrapper<F: Float + Send + Sync + 'static>(Box<dyn PipelineEstimator<F>>);
impl<F: Float + Send + Sync + 'static> PipelineStep<F> for EstimatorStepWrapper<F> {
fn add_to_pipeline(self: Box<Self>, pipeline: Pipeline<F>, name: &str) -> Pipeline<F> {
pipeline.estimator_step(name, self.0)
}
}
/// Wrap a [`PipelineTransformer`] as a [`PipelineStep`] for use with
/// [`Pipeline::step`].
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::{Pipeline, as_transform_step};
/// // Assuming `my_scaler` implements PipelineTransformer<f64>:
/// // let pipeline = Pipeline::new().step("scaler", as_transform_step(my_scaler));
/// ```
pub fn as_transform_step<F: Float + Send + Sync + 'static>(
t: impl PipelineTransformer<F> + 'static,
) -> Box<dyn PipelineStep<F>> {
Box::new(TransformerStepWrapper(Box::new(t)))
}
/// Wrap a [`PipelineEstimator`] as a [`PipelineStep`] for use with
/// [`Pipeline::step`].
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::{Pipeline, as_estimator_step};
/// // Assuming `my_model` implements PipelineEstimator<f64>:
/// // let pipeline = Pipeline::new().step("model", as_estimator_step(my_model));
/// ```
pub fn as_estimator_step<F: Float + Send + Sync + 'static>(
e: impl PipelineEstimator<F> + 'static,
) -> Box<dyn PipelineStep<F>> {
Box::new(EstimatorStepWrapper(Box::new(e)))
}
// ---------------------------------------------------------------------------
// PassthroughTransformer: the `'passthrough'` step analog (identity no-op)
// ---------------------------------------------------------------------------
/// A no-op transformer step: fit does nothing and transform returns its input
/// unchanged.
///
/// This is the ferrolearn analog of scikit-learn's `'passthrough'` (and `None`)
/// pipeline step. In sklearn, a `Pipeline` step whose object is the string
/// `'passthrough'` (or `None`) is a transformer that is *skipped* during
/// fit/transform — `_iter(filter_passthrough=True)` drops it
/// (`sklearn/pipeline.py:275-290`), so the running `Xt` passes through unchanged
/// — yet the step is still visible in `named_steps` / `steps` / `__getitem__`
/// (`sklearn/pipeline.py:337`: `"passthrough" if estimator is None else
/// estimator`). The net behavior is identity: `Pipeline([('p','passthrough')])
/// .fit(X).transform(X) == X` (verified against the live 1.5.2 oracle).
///
/// ferrolearn encodes the transformer/estimator distinction in the type system
/// (there is no untyped `steps` list to hold a sentinel string), so rather than a
/// `filter_passthrough` branch in the fit/transform loop, the passthrough step is
/// a concrete, reusable *identity transformer*: its `fit_pipeline` is a no-op and
/// its [`FittedPassthroughTransformer::transform_pipeline`] returns `x.clone()`.
/// Placed anywhere in a [`Pipeline`] it leaves the running data unchanged and
/// still appears in [`Pipeline::step_names`] / [`Pipeline::named_steps`], exactly
/// matching sklearn's observable contract. The ergonomic builder
/// [`Pipeline::passthrough_step`] adds one under a given name (the `('name',
/// 'passthrough')` analog).
///
/// The type parameter `F` is the float type (`f32` or `f64`), defaulting to
/// `f64` to match the rest of this module.
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::{PassthroughTransformer, FittedPipelineTransformer};
/// use ferrolearn_core::pipeline::PipelineTransformer;
/// use ndarray::{Array1, Array2};
///
/// let p = PassthroughTransformer::<f64>::new();
/// let x = Array2::from_shape_vec((2, 2), vec![1.0, 2.0, 3.0, 4.0]).unwrap();
/// let y = Array1::<f64>::zeros(2);
/// let fitted = p.fit_pipeline(&x, &y).unwrap();
/// // Identity: transform returns the input unchanged.
/// assert_eq!(fitted.transform_pipeline(&x).unwrap(), x);
/// ```
pub struct PassthroughTransformer<F: Float + Send + Sync + 'static = f64> {
/// `PassthroughTransformer` holds no state; the marker ties the no-op to the
/// float type `F` so it slots into an `F`-typed [`Pipeline`].
_marker: core::marker::PhantomData<F>,
}
impl<F: Float + Send + Sync + 'static> PassthroughTransformer<F> {
/// Create a new passthrough (identity) transformer.
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::PassthroughTransformer;
/// let p = PassthroughTransformer::<f64>::new();
/// ```
#[must_use]
pub fn new() -> Self {
Self {
_marker: core::marker::PhantomData,
}
}
}
impl<F: Float + Send + Sync + 'static> Default for PassthroughTransformer<F> {
fn default() -> Self {
Self::new()
}
}
impl<F: Float + Send + Sync + 'static> PipelineTransformer<F> for PassthroughTransformer<F> {
/// Fitting a passthrough step does nothing (there are no parameters to learn);
/// it yields a [`FittedPassthroughTransformer`] whose transform is the
/// identity. Mirrors sklearn skipping a `'passthrough'` step at fit
/// (`_iter(filter_passthrough=True)`, `sklearn/pipeline.py:289`), so the
/// running `Xt` is unaffected.
fn fit_pipeline(
&self,
_x: &Array2<F>,
_y: &Array1<F>,
) -> Result<Box<dyn FittedPipelineTransformer<F>>, FerroError> {
Ok(Box::new(FittedPassthroughTransformer::new()))
}
}
/// The fitted half of a [`PassthroughTransformer`]: an identity transform.
///
/// [`transform_pipeline`](FittedPassthroughTransformer::transform_pipeline)
/// returns its input unchanged, the fitted analog of sklearn's skipped
/// `'passthrough'` step leaving the running `Xt` unchanged
/// (`sklearn/pipeline.py:275-290`).
pub struct FittedPassthroughTransformer<F: Float + Send + Sync + 'static = f64> {
/// No fitted state; the marker ties the identity transform to `F`.
_marker: core::marker::PhantomData<F>,
}
impl<F: Float + Send + Sync + 'static> FittedPassthroughTransformer<F> {
/// Create a new fitted passthrough (identity) transformer.
#[must_use]
pub fn new() -> Self {
Self {
_marker: core::marker::PhantomData,
}
}
}
impl<F: Float + Send + Sync + 'static> Default for FittedPassthroughTransformer<F> {
fn default() -> Self {
Self::new()
}
}
impl<F: Float + Send + Sync + 'static> FittedPipelineTransformer<F>
for FittedPassthroughTransformer<F>
{
/// Return the input unchanged (identity).
///
/// This is the no-op that makes a passthrough step transparent: the data the
/// next step (or final estimator) sees is exactly what entered. Matches
/// sklearn's `'passthrough'` net behavior `Pipeline([('p','passthrough')])
/// .transform(X) == X` (live 1.5.2 oracle).
fn transform_pipeline(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
Ok(x.clone())
}
}
// ---------------------------------------------------------------------------
// FeatureUnion (unfitted)
// ---------------------------------------------------------------------------
/// A composite transformer that fits multiple named sub-transformers on the
/// SAME input and horizontally concatenates their outputs.
///
/// This is the ferrolearn analog of scikit-learn's `sklearn.pipeline.FeatureUnion`
/// (`sklearn/pipeline.py:1329`). Where a [`Pipeline`] chains transformers
/// *sequentially* (each transformer sees the previous one's output),
/// `FeatureUnion` applies every transformer *in parallel* to the same `X`, then
/// concatenates the results column-wise: the output width is the sum of each
/// sub-transformer's output width, and the columns appear left-to-right in the
/// order the transformers were added (mirrors `FeatureUnion.transform` →
/// `_hstack` `np.hstack(Xs)`, `sklearn/pipeline.py:1770`/`:1812`).
///
/// `FeatureUnion` reuses the [`PipelineTransformer`] / [`FittedPipelineTransformer`]
/// trait objects already used by [`Pipeline`], so any transformer usable in a
/// pipeline is usable in a feature union.
///
/// The type parameter `F` is the float type (`f32` or `f64`), defaulting to
/// `f64` to match the rest of this module.
///
/// # Divergence from scikit-learn
///
/// This is the core fit / transform / hstack / `get_feature_names_out` subset.
/// `transformer_weights` (per-transformer output scaling,
/// `sklearn/pipeline.py:1369`), the `'drop'` / `'passthrough'` sentinels
/// (`:1530`/`:1563`), `n_jobs` parallelism (`:1360`), metadata routing (`:1859`),
/// and `verbose_feature_names_out=False` (`:1618`) are NOT implemented (REQ-8
/// NOT-STARTED scope). The data substrate is `ndarray`, not yet ferray.
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::{
/// FeatureUnion, PipelineTransformer, FittedPipelineTransformer,
/// };
/// use ferrolearn_core::{Transform, FerroError};
/// use ndarray::{Array1, Array2};
///
/// // A transformer that returns its input unchanged.
/// struct Identity;
/// impl PipelineTransformer<f64> for Identity {
/// fn fit_pipeline(
/// &self,
/// _x: &Array2<f64>,
/// _y: &Array1<f64>,
/// ) -> Result<Box<dyn FittedPipelineTransformer<f64>>, FerroError> {
/// Ok(Box::new(FittedIdentity))
/// }
/// }
/// struct FittedIdentity;
/// impl FittedPipelineTransformer<f64> for FittedIdentity {
/// fn transform_pipeline(&self, x: &Array2<f64>) -> Result<Array2<f64>, FerroError> {
/// Ok(x.clone())
/// }
/// }
///
/// use ferrolearn_core::Fit;
/// let union = FeatureUnion::<f64>::new()
/// .with_transformer("a", Box::new(Identity))
/// .with_transformer("b", Box::new(Identity));
/// let x = Array2::from_shape_vec((2, 2), vec![1.0, 2.0, 3.0, 4.0]).unwrap();
/// let fitted = union.fit(&x, &()).unwrap();
/// // Two identity transformers → output width 2 + 2 = 4.
/// let out = fitted.transform(&x).unwrap();
/// assert_eq!(out.dim(), (2, 4));
/// assert_eq!(fitted.get_feature_names_out(), vec!["a__x0", "a__x1", "b__x0", "b__x1"]);
/// ```
pub struct FeatureUnion<F: Float + Send + Sync + 'static = f64> {
/// Ordered named transformers, all fit on the same input.
transformer_list: Vec<(String, Box<dyn PipelineTransformer<F>>)>,
}
impl<F: Float + Send + Sync + 'static> FeatureUnion<F> {
/// Create a new empty feature union.
///
/// Sub-transformers are added with
/// [`with_transformer`](FeatureUnion::with_transformer). An empty union fits
/// successfully and transforms to a `(n_samples, 0)` matrix (the empty
/// `np.hstack` analog).
///
/// # Examples
///
/// ```
/// use ferrolearn_core::pipeline::FeatureUnion;
/// let union = FeatureUnion::<f64>::new();
/// assert_eq!(union.n_transformers(), 0);
/// ```
#[must_use]
pub fn new() -> Self {
Self {
transformer_list: Vec::new(),
}
}
/// Add a named transformer to the union using the builder pattern.
///
/// Mirrors an entry of sklearn's `transformer_list`
/// (`sklearn/pipeline.py:1348`). Transformers are applied in the order they
/// are added; their outputs are concatenated left-to-right.
#[must_use]
pub fn with_transformer(mut self, name: &str, t: Box<dyn PipelineTransformer<F>>) -> Self {
self.transformer_list.push((name.to_owned(), t));
self
}
/// Returns the names of all sub-transformers, in union order.
///
/// Mirrors the key order of sklearn's `named_transformers`
/// (`sklearn/pipeline.py:1478`: `Bunch(**dict(self.transformer_list))`).
#[must_use]
pub fn transformer_names(&self) -> Vec<&str> {
self.transformer_list
.iter()
.map(|(name, _)| name.as_str())
.collect()
}
/// Number of sub-transformers in the union.
#[must_use]
pub fn n_transformers(&self) -> usize {
self.transformer_list.len()
}
}
impl<F: Float + Send + Sync + 'static> Default for FeatureUnion<F> {
fn default() -> Self {
Self::new()
}
}
impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for FeatureUnion<F> {
type Fitted = FittedFeatureUnion<F>;
type Error = FerroError;
/// Fit every sub-transformer on the SAME input `x`.
///
/// Mirrors `FeatureUnion.fit` (`sklearn/pipeline.py:1643`), which fits each
/// transformer in `transformer_list` independently on the full `X` (every
/// transformer sees the same data, unlike the sequential `Pipeline`). The
/// per-transformer output width is recorded at fit time (by transforming `x`
/// once) so that `get_feature_names_out` can size each column block.
///
/// # `y` handling
///
/// sklearn's `FeatureUnion` threads `y` to each sub-transformer's `fit`
/// (`sklearn/pipeline.py:1681`/`_fit_one`), but feature-union transformers are
/// unsupervised and ignore it. ferrolearn's [`PipelineTransformer::fit_pipeline`]
/// requires an `Array1<F>` target, so this impl passes an empty
/// `Array1::zeros(0)` — the union's own `Fit` target type is `()` (it takes no
/// supervised target), and the empty array is the no-target sentinel handed to
/// each unsupervised sub-transformer.
///
/// # Errors
///
/// Propagates any [`FerroError`] from an individual sub-transformer's
/// `fit_pipeline` or its width-probing `transform_pipeline`.
fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedFeatureUnion<F>, FerroError> {
// Validate transformer-name uniqueness BEFORE fitting any sub-transformer,
// mirroring `FeatureUnion._validate_transformers` → `_validate_names`
// (`sklearn/pipeline.py:1523-1525` → `sklearn/utils/metaestimators.py:81-83`),
// which sklearn runs on every fit/fit_transform: `if len(set(names)) !=
// len(names): raise ValueError("Names provided are not unique: {names!r}")`.
// R-DEV-2 (user-API ABI / exception parity): a duplicate name is a
// deliberate `ValueError`, so ferrolearn rejects it at fit with the
// closest analog, `FerroError::InvalidParameter`.
let names: Vec<&str> = self
.transformer_list
.iter()
.map(|(name, _)| name.as_str())
.collect();
let mut seen = std::collections::HashSet::with_capacity(names.len());
if !names.iter().all(|name| seen.insert(*name)) {
return Err(FerroError::InvalidParameter {
name: "transformer_list".into(),
reason: format!("Names provided are not unique: {names:?}"),
});
}
// Reject any name containing the reserved `__` separator, mirroring the
// THIRD clause of `_validate_names`
// (`sklearn/utils/metaestimators.py:91-95`): `invalid_names = [name for
// name in names if "__" in name]; if invalid_names: raise
// ValueError("Estimator names must not contain __: got {0!r}")`. `__` is
// reserved for the nested-parameter addressing protocol
// (`<step>__<param>`), so it is forbidden anywhere in a step name (a
// single `_` is fine). sklearn runs this AFTER the uniqueness clause; we
// match that order. R-DEV-2 (exception parity): a deliberate `ValueError`,
// mapped to the closest analog `FerroError::InvalidParameter`. (The MIDDLE
// clause — names colliding with constructor-arg params,
// `metaestimators.py:84-90` — has no ferrolearn analog: `FeatureUnion`
// exposes no `get_params` params, so it is intentionally not mirrored.)
let invalid_names: Vec<&str> = names
.iter()
.copied()
.filter(|name| name.contains("__"))
.collect();
if !invalid_names.is_empty() {
return Err(FerroError::InvalidParameter {
name: "transformer_list".into(),
reason: format!("Estimator names must not contain __: got {invalid_names:?}"),
});
}
// FeatureUnion sub-transformers are unsupervised; sklearn passes `y`
// through but the transformers ignore it (`sklearn/pipeline.py:1681`).
// The empty target is the no-supervision sentinel for `fit_pipeline`.
let empty_y: Array1<F> = Array1::zeros(0);
let mut fitted = Vec::with_capacity(self.transformer_list.len());
let mut n_features_per = Vec::with_capacity(self.transformer_list.len());
for (name, transformer) in &self.transformer_list {
let fitted_t = transformer.fit_pipeline(x, &empty_y)?;
// Probe the output width once at fit so feature-name prefixing and
// the hstack column layout know each block's size.
let out = fitted_t.transform_pipeline(x)?;
n_features_per.push(out.ncols());
fitted.push((name.clone(), fitted_t));
}
Ok(FittedFeatureUnion {
fitted,
n_features_per,
})
}
}
// ---------------------------------------------------------------------------
// FittedFeatureUnion
// ---------------------------------------------------------------------------
/// A fitted [`FeatureUnion`]: each named sub-transformer is fitted, and the
/// per-transformer output width is recorded for feature-name prefixing and the
/// horizontal-concatenation column layout.
///
/// Created by calling [`Fit::fit`] on a [`FeatureUnion`]. Implements
/// [`Transform<Array2<F>>`](Transform) producing the horizontally concatenated
/// `Array2<F>`.
pub struct FittedFeatureUnion<F: Float + Send + Sync + 'static = f64> {
/// Fitted sub-transformers, in union order.
fitted: Vec<(String, Box<dyn FittedPipelineTransformer<F>>)>,
/// The output column count of each sub-transformer, in union order
/// (recorded at fit). The total output width is the sum of these.
n_features_per: Vec<usize>,
}
impl<F: Float + Send + Sync + 'static> FittedFeatureUnion<F> {
/// Returns the names of all fitted sub-transformers, in union order.
#[must_use]
pub fn transformer_names(&self) -> Vec<&str> {
self.fitted.iter().map(|(name, _)| name.as_str()).collect()
}
/// Number of fitted sub-transformers in the union.
#[must_use]
pub fn n_transformers(&self) -> usize {
self.fitted.len()
}
/// Total output width: the sum of every sub-transformer's output column
/// count. Equals the number of columns in [`Transform::transform`]'s output.
#[must_use]
pub fn n_features_out(&self) -> usize {
self.n_features_per.iter().sum()
}
/// Output feature names, one per output column, in concatenation order.
///
/// For each sub-transformer named `name` with output width `w`, this emits
/// `"{name}__x0" .. "{name}__x{w-1}"`, then moves on to the next
/// transformer's block. This mirrors `FeatureUnion.get_feature_names_out`
/// with the default `verbose_feature_names_out=True`
/// (`sklearn/pipeline.py:1567`/`:1608-1616`): sklearn prefixes each
/// sub-transformer's own feature name with `"{name}__"`.
///
/// ferrolearn's [`PipelineTransformer`] trait objects do not expose their own
/// per-output feature names, so the positional default `x{j}` is used as the
/// suffix — this is sklearn's `OneToOneFeatureMixin` positional default
/// (`['x0','x1',...]`), which is exactly what `StandardScaler` /
/// `MinMaxScaler` and other column-preserving transformers produce. So a union
/// of two such transformers named `ss`/`mm` over 2-column input yields
/// `['ss__x0','ss__x1','mm__x0','mm__x1']`, matching the live oracle.
#[must_use]
pub fn get_feature_names_out(&self) -> Vec<String> {
let mut names = Vec::with_capacity(self.n_features_out());
for ((name, _), &width) in self.fitted.iter().zip(self.n_features_per.iter()) {
for j in 0..width {
names.push(format!("{name}__x{j}"));
}
}
names
}
}
impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedFeatureUnion<F> {
type Output = Array2<F>;
type Error = FerroError;
/// Transform `x` through every fitted sub-transformer and horizontally
/// concatenate the results.
///
/// Mirrors `FeatureUnion.transform` (`sklearn/pipeline.py:1770`): each
/// transformer transforms the same `x`, then `self._hstack(Xs)`
/// (`np.hstack`, `:1812`/`:1820`) concatenates the outputs column-wise. The
/// output has shape `(n_samples, sum_of_widths)` and the columns appear in
/// transformer order: block 0 is the first transformer's full output, block 1
/// the second's, and so on. An empty union transforms to a `(n_samples, 0)`
/// matrix (the empty-`np.hstack` analog).
///
/// # Errors
///
/// Propagates any [`FerroError`] from an individual sub-transformer. Returns
/// [`FerroError::ShapeMismatch`] if a sub-transformer's output does not have
/// `n_samples == x.nrows()` rows (the hstack requires row-aligned blocks).
fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_rows = x.nrows();
// Transform `x` through each sub-transformer, collecting the blocks and
// their widths. Validate each block is row-aligned before any copy.
let mut blocks: Vec<Array2<F>> = Vec::with_capacity(self.fitted.len());
let mut total_width = 0usize;
for (name, transformer) in &self.fitted {
let block = transformer.transform_pipeline(x)?;
if block.nrows() != n_rows {
return Err(FerroError::ShapeMismatch {
expected: vec![n_rows, block.ncols()],
actual: vec![block.nrows(), block.ncols()],
context: format!(
"FeatureUnion transformer `{name}` produced {} rows, expected {n_rows} \
(every sub-transformer output must be row-aligned for hstack)",
block.nrows()
),
});
}
total_width += block.ncols();
blocks.push(block);
}
// Allocate the concatenated output and copy each block into its
// contiguous column range, left-to-right (bounds-safe: `col_offset` and
// each block width are derived from the blocks just collected).
let mut out = Array2::<F>::zeros((n_rows, total_width));
let mut col_offset = 0usize;
for block in &blocks {
let width = block.ncols();
for r in 0..n_rows {
for c in 0..width {
out[[r, col_offset + c]] = block[[r, c]];
}
}
col_offset += width;
}
Ok(out)
}
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
// -- Test fixtures -------------------------------------------------------
/// A trivial transformer that doubles all values.
struct DoublingTransformer;
impl PipelineTransformer<f64> for DoublingTransformer {
fn fit_pipeline(
&self,
_x: &Array2<f64>,
_y: &Array1<f64>,
) -> Result<Box<dyn FittedPipelineTransformer<f64>>, FerroError> {
Ok(Box::new(FittedDoublingTransformer))
}
}
struct FittedDoublingTransformer;
impl FittedPipelineTransformer<f64> for FittedDoublingTransformer {
fn transform_pipeline(&self, x: &Array2<f64>) -> Result<Array2<f64>, FerroError> {
Ok(x.mapv(|v| v * 2.0))
}
}
/// A trivial estimator that sums each row.
struct SumEstimator;
impl PipelineEstimator<f64> for SumEstimator {
fn fit_pipeline(
&self,
_x: &Array2<f64>,
_y: &Array1<f64>,
) -> Result<Box<dyn FittedPipelineEstimator<f64>>, FerroError> {
Ok(Box::new(FittedSumEstimator))
}
}
struct FittedSumEstimator;
impl FittedPipelineEstimator<f64> for FittedSumEstimator {
fn predict_pipeline(&self, x: &Array2<f64>) -> Result<Array1<f64>, FerroError> {
let sums: Vec<f64> = x.rows().into_iter().map(|row| row.sum()).collect();
Ok(Array1::from_vec(sums))
}
}
// -- f32 test fixtures ---------------------------------------------------
/// A trivial f32 transformer that doubles all values.
struct DoublingTransformerF32;
impl PipelineTransformer<f32> for DoublingTransformerF32 {
fn fit_pipeline(
&self,
_x: &Array2<f32>,
_y: &Array1<f32>,
) -> Result<Box<dyn FittedPipelineTransformer<f32>>, FerroError> {
Ok(Box::new(FittedDoublingTransformerF32))
}
}
struct FittedDoublingTransformerF32;
impl FittedPipelineTransformer<f32> for FittedDoublingTransformerF32 {
fn transform_pipeline(&self, x: &Array2<f32>) -> Result<Array2<f32>, FerroError> {
Ok(x.mapv(|v| v * 2.0))
}
}
/// A trivial f32 estimator that sums each row.
struct SumEstimatorF32;
impl PipelineEstimator<f32> for SumEstimatorF32 {
fn fit_pipeline(
&self,
_x: &Array2<f32>,
_y: &Array1<f32>,
) -> Result<Box<dyn FittedPipelineEstimator<f32>>, FerroError> {
Ok(Box::new(FittedSumEstimatorF32))
}
}
struct FittedSumEstimatorF32;
impl FittedPipelineEstimator<f32> for FittedSumEstimatorF32 {
fn predict_pipeline(&self, x: &Array2<f32>) -> Result<Array1<f32>, FerroError> {
let sums: Vec<f32> = x.rows().into_iter().map(|row| row.sum()).collect();
Ok(Array1::from_vec(sums))
}
}
// -- Tests ---------------------------------------------------------------
#[test]
fn test_pipeline_fit_predict() {
let pipeline = Pipeline::new()
.transform_step("doubler", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = Array2::from_shape_vec((2, 3), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]).unwrap();
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y).unwrap();
let preds = fitted.predict(&x).unwrap();
// After doubling: [[2,4,6],[8,10,12]], sums: [12, 30]
assert_eq!(preds.len(), 2);
assert!((preds[0] - 12.0).abs() < 1e-10);
assert!((preds[1] - 30.0).abs() < 1e-10);
}
#[test]
fn test_pipeline_f32_fit_predict() {
let pipeline = Pipeline::<f32>::new()
.transform_step("doubler", Box::new(DoublingTransformerF32))
.estimator_step("sum", Box::new(SumEstimatorF32));
let x = Array2::from_shape_vec((2, 3), vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0]).unwrap();
let y = Array1::from_vec(vec![0.0f32, 1.0]);
let fitted = pipeline.fit(&x, &y).unwrap();
let preds = fitted.predict(&x).unwrap();
assert_eq!(preds.len(), 2);
assert!((preds[0] - 12.0).abs() < 1e-5);
assert!((preds[1] - 30.0).abs() < 1e-5);
}
#[test]
fn test_pipeline_step_builder() {
let pipeline = Pipeline::new()
.step("doubler", as_transform_step(DoublingTransformer))
.step("sum", as_estimator_step(SumEstimator));
let x = Array2::from_shape_vec((2, 3), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]).unwrap();
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y).unwrap();
let preds = fitted.predict(&x).unwrap();
assert!((preds[0] - 12.0).abs() < 1e-10);
assert!((preds[1] - 30.0).abs() < 1e-10);
}
#[test]
fn test_pipeline_rejects_inconsistent_x_y() {
// sklearn's Pipeline.fit validates X/y consistency before fitting any
// step (check_consistent_length, validation.py:1320): a mismatched
// n_samples raises ValueError. Live oracle:
// from sklearn.pipeline import Pipeline
// from sklearn.preprocessing import StandardScaler
// from sklearn.naive_bayes import GaussianNB; import numpy as np
// p = Pipeline([("s", StandardScaler()), ("c", GaussianNB())])
// try: p.fit(np.zeros((3,2)), np.zeros(4)); print("OK")
// except ValueError: print("RAISE") # -> RAISE
let pipeline = Pipeline::new()
.transform_step("doubler", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = Array2::<f64>::zeros((3, 2));
let y = Array1::from_vec(vec![0.0, 1.0]); // len 2 != 3 rows
let result = pipeline.fit(&x, &y);
assert!(matches!(result, Err(FerroError::ShapeMismatch { .. })));
}
#[test]
fn test_pipeline_accepts_consistent_x_y() -> Result<(), FerroError> {
// The guard must not reject well-formed X/y (live oracle: same Pipeline
// with matching shapes -> OK).
let pipeline = Pipeline::new()
.transform_step("doubler", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x =
Array2::from_shape_vec((2, 3), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]).map_err(|e| {
FerroError::InvalidParameter {
name: "x".into(),
reason: e.to_string(),
}
})?;
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y)?;
assert_eq!(fitted.predict(&x)?.len(), 2);
Ok(())
}
#[test]
fn test_pipeline_no_estimator_returns_error() {
let pipeline = Pipeline::new().transform_step("doubler", Box::new(DoublingTransformer));
let x = Array2::<f64>::zeros((2, 3));
let y = Array1::from_vec(vec![0.0, 1.0]);
let result = pipeline.fit(&x, &y);
assert!(result.is_err());
}
#[test]
fn test_pipeline_estimator_only() {
let pipeline = Pipeline::new().estimator_step("sum", Box::new(SumEstimator));
let x = Array2::from_shape_vec((2, 3), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]).unwrap();
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y).unwrap();
let preds = fitted.predict(&x).unwrap();
// No transform, just sum: [6, 15]
assert!((preds[0] - 6.0).abs() < 1e-10);
assert!((preds[1] - 15.0).abs() < 1e-10);
}
#[test]
fn test_fitted_pipeline_step_names() {
let pipeline = Pipeline::new()
.transform_step("scaler", Box::new(DoublingTransformer))
.transform_step("normalizer", Box::new(DoublingTransformer))
.estimator_step("clf", Box::new(SumEstimator));
let x = Array2::<f64>::zeros((2, 3));
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y).unwrap();
let names = fitted.step_names();
assert_eq!(names, vec!["scaler", "normalizer", "clf"]);
}
#[test]
fn test_multiple_transform_steps() {
// Two doublers in sequence should quadruple values.
let pipeline = Pipeline::new()
.transform_step("double1", Box::new(DoublingTransformer))
.transform_step("double2", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = Array2::from_shape_vec((1, 2), vec![1.0, 1.0]).unwrap();
let y = Array1::from_vec(vec![0.0]);
let fitted = pipeline.fit(&x, &y).unwrap();
let preds = fitted.predict(&x).unwrap();
// 1.0 * 2 * 2 = 4.0 per element, sum of 2 elements = 8.0
assert!((preds[0] - 8.0).abs() < 1e-10);
}
#[test]
fn test_pipeline_default() {
let pipeline = Pipeline::<f64>::default();
let x = Array2::<f64>::zeros((2, 3));
let y = Array1::from_vec(vec![0.0, 1.0]);
// Should error because no estimator.
assert!(pipeline.fit(&x, &y).is_err());
}
#[test]
fn test_pipeline_is_send_sync() {
fn assert_send_sync<T: Send + Sync>() {}
// Pipeline itself is Send+Sync because it only stores
// Send+Sync trait objects.
assert_send_sync::<Pipeline<f64>>();
assert_send_sync::<Pipeline<f32>>();
assert_send_sync::<FittedPipeline<f64>>();
assert_send_sync::<FittedPipeline<f32>>();
}
// -- REQ-3: fit_transform / transform / predict_proba / decision_function /
// score ---------------------------------------------------------------
/// An estimator that overrides the probability/decision/score delegations,
/// proving the new default-Err trait methods can be overridden by a real
/// final estimator (mirrors how `GaussianNB` does so in `gaussian.rs`).
struct ProbaEstimator;
impl PipelineEstimator<f64> for ProbaEstimator {
fn fit_pipeline(
&self,
_x: &Array2<f64>,
_y: &Array1<f64>,
) -> Result<Box<dyn FittedPipelineEstimator<f64>>, FerroError> {
Ok(Box::new(FittedProbaEstimator))
}
}
struct FittedProbaEstimator;
impl FittedPipelineEstimator<f64> for FittedProbaEstimator {
fn predict_pipeline(&self, x: &Array2<f64>) -> Result<Array1<f64>, FerroError> {
// Predict 1.0 when the row sum is positive, else 0.0.
Ok(Array1::from_iter(
x.rows()
.into_iter()
.map(|r| if r.sum() > 0.0 { 1.0 } else { 0.0 }),
))
}
fn predict_proba_pipeline(&self, x: &Array2<f64>) -> Result<Array2<f64>, FerroError> {
// A deterministic two-column "probability" (sigmoid of row sum).
let mut out = Array2::<f64>::zeros((x.nrows(), 2));
for (i, r) in x.rows().into_iter().enumerate() {
let p1 = 1.0 / (1.0 + (-r.sum()).exp());
out[[i, 0]] = 1.0 - p1;
out[[i, 1]] = p1;
}
Ok(out)
}
fn score_pipeline(&self, x: &Array2<f64>, y: &Array1<f64>) -> Result<f64, FerroError> {
let preds = self.predict_pipeline(x)?;
let n = y.len();
if n == 0 {
return Ok(0.0);
}
let correct = preds
.iter()
.zip(y.iter())
.filter(|(p, t)| (**p - **t).abs() < 1e-12)
.count();
Ok(correct as f64 / n as f64)
}
}
#[test]
fn test_pipeline_fit_transform_equals_transform() -> Result<(), FerroError> {
// fit_transform must return exactly what FittedPipeline::transform
// returns on the same input (fit-then-transform ≡ fused fit_transform).
let pipeline = Pipeline::new()
.transform_step("double1", Box::new(DoublingTransformer))
.transform_step("double2", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0]];
let y = Array1::from_vec(vec![0.0, 1.0]);
let (fitted, xt) = pipeline.fit_transform(&x, &y)?;
// Two doublers quadruple the data.
let expected = x.mapv(|v| v * 4.0);
assert_eq!(xt, expected);
// transform() on the fitted pipeline matches fit_transform's output.
let xt2 = fitted.transform(&x)?;
assert_eq!(xt2, expected);
Ok(())
}
#[test]
fn test_pipeline_transform_applies_only_transformer_steps() -> Result<(), FerroError> {
// FittedPipeline::transform returns the data the estimator would see —
// i.e. only the transformer prefix is applied, not the estimator.
let pipeline = Pipeline::new()
.transform_step("doubler", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = ndarray::array![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y)?;
let xt = fitted.transform(&x)?;
assert_eq!(xt, x.mapv(|v| v * 2.0));
Ok(())
}
#[test]
fn test_pipeline_predict_proba_default_is_err() -> Result<(), FerroError> {
// SumEstimator does not override predict_proba_pipeline → the default
// Err (sklearn AttributeError analog) fires.
let pipeline = Pipeline::new()
.transform_step("doubler", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = ndarray::array![[1.0, 1.0]];
let y = Array1::from_vec(vec![0.0]);
let fitted = pipeline.fit(&x, &y)?;
assert!(matches!(
fitted.predict_proba(&x),
Err(FerroError::InvalidParameter { .. })
));
assert!(matches!(
fitted.decision_function(&x),
Err(FerroError::InvalidParameter { .. })
));
assert!(matches!(
fitted.score(&x, &y),
Err(FerroError::InvalidParameter { .. })
));
Ok(())
}
#[test]
fn test_pipeline_predict_proba_and_score_override() -> Result<(), FerroError> {
// ProbaEstimator overrides the delegations. The transformer doubles the
// data; the proba estimator sees the doubled rows.
let pipeline = Pipeline::new()
.transform_step("doubler", Box::new(DoublingTransformer))
.estimator_step("clf", Box::new(ProbaEstimator));
let x = ndarray::array![[1.0], [-2.0]];
let y = Array1::from_vec(vec![1.0, 0.0]);
let fitted = pipeline.fit(&x, &y)?;
// Doubled rows: [2.0], [-4.0]. p1 = sigmoid(row sum).
let proba = fitted.predict_proba(&x)?;
assert_eq!(proba.dim(), (2, 2));
for i in 0..2 {
assert!((proba.row(i).sum() - 1.0).abs() < 1e-12);
}
let p1_row0 = 1.0 / (1.0 + (-2.0f64).exp());
assert!((proba[[0, 1]] - p1_row0).abs() < 1e-12);
// Both rows predicted correctly → score 1.0.
let s = fitted.score(&x, &y)?;
assert!((s - 1.0).abs() < 1e-12);
Ok(())
}
// -- REQ-4a: named_steps / get_step / get_step_by_name / into_slice -------
fn is_transformer(r: &PipelineStepRef<'_, f64>) -> bool {
matches!(r, PipelineStepRef::Transformer(_))
}
fn is_estimator(r: &PipelineStepRef<'_, f64>) -> bool {
matches!(r, PipelineStepRef::Estimator(_))
}
#[test]
fn test_pipeline_named_steps_match_sklearn() {
// sklearn: Pipeline([('a',StandardScaler()),('b',MinMaxScaler()),
// ('c',GaussianNB())]).named_steps keys order
// == ['a', 'b', 'c'] (live oracle, sklearn 1.5.2;
// `named_steps = Bunch(**dict(self.steps))`, pipeline.py:325).
// Every step is reachable by its construction name, in order.
let pipeline = Pipeline::new()
.transform_step("a", Box::new(DoublingTransformer))
.transform_step("b", Box::new(DoublingTransformer))
.estimator_step("c", Box::new(SumEstimator));
let named = pipeline.named_steps();
let names: Vec<&str> = named.iter().map(|(n, _)| *n).collect();
assert_eq!(names, vec!["a", "b", "c"]);
// The two transformer steps are transformers; the final is the estimator.
assert!(is_transformer(&named[0].1));
assert!(is_transformer(&named[1].1));
assert!(is_estimator(&named[2].1));
// step_names() agrees with named_steps() key order.
assert_eq!(pipeline.step_names(), names);
// len() counts every step (3), matching sklearn len(pipe)==3.
assert_eq!(pipeline.len(), 3);
assert!(!pipeline.is_empty());
}
#[test]
fn test_pipeline_get_step_integer() {
// sklearn: p[0] -> first step object, p[2] -> last (estimator);
// p[10] -> IndexError (live oracle). ferrolearn returns None OOB.
let pipeline = Pipeline::new()
.transform_step("a", Box::new(DoublingTransformer))
.transform_step("b", Box::new(DoublingTransformer))
.estimator_step("c", Box::new(SumEstimator));
assert!(matches!(
pipeline.get_step(0),
Some(PipelineStepRef::Transformer(_))
));
assert!(matches!(
pipeline.get_step(1),
Some(PipelineStepRef::Transformer(_))
));
assert!(matches!(
pipeline.get_step(2),
Some(PipelineStepRef::Estimator(_))
));
// Out of range -> None (sklearn raises IndexError).
assert!(pipeline.get_step(3).is_none());
assert!(pipeline.get_step(10).is_none());
}
#[test]
fn test_pipeline_get_step_by_name() {
// sklearn: p['b'] -> the 'b' step; p['nope'] -> KeyError (live oracle).
let pipeline = Pipeline::new()
.transform_step("a", Box::new(DoublingTransformer))
.transform_step("b", Box::new(DoublingTransformer))
.estimator_step("c", Box::new(SumEstimator));
assert!(matches!(
pipeline.get_step_by_name("b"),
Some(PipelineStepRef::Transformer(_))
));
assert!(matches!(
pipeline.get_step_by_name("c"),
Some(PipelineStepRef::Estimator(_))
));
assert!(matches!(
pipeline.named_step("a"),
Some(PipelineStepRef::Transformer(_))
));
// Unknown name -> None (sklearn raises KeyError).
assert!(pipeline.get_step_by_name("nope").is_none());
assert!(pipeline.named_step("nope").is_none());
}
#[test]
fn test_pipeline_into_slice() -> Result<(), FerroError> {
// sklearn: p[0:2].steps names == ['a','b'] (a sub-Pipeline of the
// contiguous range; pipeline.py:310). p[:1] == ['a']. p[:] == all.
// p[1:1] == [] (empty). (live oracle, sklearn 1.5.2.)
let build = || {
Pipeline::new()
.transform_step("a", Box::new(DoublingTransformer))
.transform_step("b", Box::new(DoublingTransformer))
.estimator_step("c", Box::new(SumEstimator))
};
// [0, 2) -> first two transformer steps, no estimator.
let sub = build().into_slice(0, 2);
assert_eq!(sub.step_names(), vec!["a", "b"]);
assert_eq!(sub.len(), 2);
// [0, 1) -> just the first step.
let sub = build().into_slice(0, 1);
assert_eq!(sub.step_names(), vec!["a"]);
// [0, 3) -> the whole pipeline (full range), estimator preserved.
let sub = build().into_slice(0, 3);
assert_eq!(sub.step_names(), vec!["a", "b", "c"]);
// [2, 3) -> just the estimator step.
let sub = build().into_slice(2, 3);
assert_eq!(sub.step_names(), vec!["c"]);
// Empty range -> empty pipeline.
let sub = build().into_slice(1, 1);
assert!(sub.step_names().is_empty());
assert!(sub.is_empty());
Ok(())
}
#[test]
fn test_pipeline_into_slice_clamps_like_python() {
// sklearn `Pipeline.__getitem__` slices `self.steps[ind]` (Python list
// slice, `pipeline.py:310`): out-of-range bounds CLAMP, never raise
// (#2235). Live oracle (sklearn 1.5.2, 2-step pipeline):
// p[0:5].steps -> ['a','c'] (clamp); p[2:1] -> []; p[5:100] -> [].
let build = || {
Pipeline::new()
.transform_step("a", Box::new(DoublingTransformer))
.estimator_step("c", Box::new(SumEstimator))
};
// end past len (2) -> clamp to all.
assert_eq!(build().into_slice(0, 5).step_names(), vec!["a", "c"]);
// start > end -> empty.
assert!(build().into_slice(2, 1).is_empty());
// start past len -> empty.
assert!(build().into_slice(5, 100).is_empty());
}
#[test]
fn test_pipeline_into_slice_transformer_only_still_fits_estimatorless() -> Result<(), FerroError>
{
// A slice dropping the estimator yields an estimator-less pipeline that
// (like sklearn's transformer-only sub-pipeline) is valid to build but
// errors at fit (matches REQ-2's no-estimator rejection).
let pipeline = Pipeline::new()
.transform_step("a", Box::new(DoublingTransformer))
.estimator_step("c", Box::new(SumEstimator));
let sub = pipeline.into_slice(0, 1);
let x = Array2::<f64>::zeros((2, 2));
let y = Array1::from_vec(vec![0.0, 1.0]);
assert!(matches!(
sub.fit(&x, &y),
Err(FerroError::InvalidParameter { .. })
));
Ok(())
}
#[test]
fn test_fitted_pipeline_named_steps_and_get_step() -> Result<(), FerroError> {
// The accessors work on the FITTED pipeline too. Names match
// construction order (sklearn named_steps on a fitted Pipeline).
let pipeline = Pipeline::new()
.transform_step("scaler", Box::new(DoublingTransformer))
.transform_step("norm", Box::new(DoublingTransformer))
.estimator_step("clf", Box::new(SumEstimator));
let x = Array2::<f64>::zeros((2, 3));
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y)?;
let names: Vec<&str> = fitted.named_steps().iter().map(|(n, _)| *n).collect();
assert_eq!(names, vec!["scaler", "norm", "clf"]);
assert_eq!(fitted.len(), 3);
assert!(!fitted.is_empty());
// get_step by integer.
assert!(matches!(
fitted.get_step(0),
Some(FittedPipelineStepRef::Transformer(_))
));
assert!(matches!(
fitted.get_step(2),
Some(FittedPipelineStepRef::Estimator(_))
));
assert!(fitted.get_step(3).is_none());
// get_step_by_name / named_step.
assert!(matches!(
fitted.get_step_by_name("norm"),
Some(FittedPipelineStepRef::Transformer(_))
));
assert!(matches!(
fitted.named_step("clf"),
Some(FittedPipelineStepRef::Estimator(_))
));
assert!(fitted.named_step("nope").is_none());
Ok(())
}
// -- REQ-5a: passthrough steps -------------------------------------------
#[test]
fn test_passthrough_step_is_identity() -> Result<(), FerroError> {
// Live oracle (sklearn 1.5.2):
// from sklearn.pipeline import Pipeline; import numpy as np
// X = np.array([[1.,2.],[3.,4.],[5.,6.]])
// p = Pipeline([('p','passthrough')]).fit(X)
// np.array_equal(p.transform(X), X) -> True
// A pipeline whose only transformer is a passthrough step leaves X
// unchanged. ferrolearn needs a final estimator slot to fit, so we add a
// SumEstimator after; transform() (the transformer prefix) must equal X.
let pipeline = Pipeline::new()
.passthrough_step("p")
.estimator_step("sum", Box::new(SumEstimator));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
let y = Array1::from_vec(vec![0.0, 1.0, 2.0]);
let fitted = pipeline.fit(&x, &y)?;
// transform() applies only the (passthrough) transformer prefix -> X.
assert_eq!(fitted.transform(&x)?, x);
Ok(())
}
#[test]
fn test_passthrough_before_transformer_is_noop() -> Result<(), FerroError> {
// Live oracle (sklearn 1.5.2):
// Pipeline([('pass','passthrough'),('ss',StandardScaler())]).fit(X)
// .transform(X)
// == Pipeline([('ss',StandardScaler())]).fit(X).transform(X) -> True
// A passthrough BEFORE a real transformer is a no-op: the result equals
// the transformer alone. ferrolearn analog: a passthrough before a
// DoublingTransformer == the doubler alone.
let with_pass = Pipeline::new()
.passthrough_step("pass")
.transform_step("dbl", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let without_pass = Pipeline::new()
.transform_step("dbl", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0]];
let y = Array1::from_vec(vec![0.0, 1.0]);
let a = with_pass.fit(&x, &y)?.transform(&x)?;
let b = without_pass.fit(&x, &y)?.transform(&x)?;
assert_eq!(a, b);
// And it equals the doubler applied to X.
assert_eq!(a, x.mapv(|v| v * 2.0));
Ok(())
}
#[test]
fn test_passthrough_after_transformer_is_noop() -> Result<(), FerroError> {
// Live oracle (sklearn 1.5.2):
// Pipeline([('ss',StandardScaler()),('pass','passthrough')]).transform(X)
// == Pipeline([('ss',StandardScaler())]).transform(X) -> True
// A passthrough AFTER a real transformer is a no-op. ferrolearn analog:
// doubler then passthrough == doubler alone.
let with_pass = Pipeline::new()
.transform_step("dbl", Box::new(DoublingTransformer))
.passthrough_step("pass")
.estimator_step("sum", Box::new(SumEstimator));
let without_pass = Pipeline::new()
.transform_step("dbl", Box::new(DoublingTransformer))
.estimator_step("sum", Box::new(SumEstimator));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0]];
let y = Array1::from_vec(vec![0.0, 1.0]);
let a = with_pass.fit(&x, &y)?.transform(&x)?;
let b = without_pass.fit(&x, &y)?.transform(&x)?;
assert_eq!(a, b);
assert_eq!(a, x.mapv(|v| v * 2.0));
Ok(())
}
#[test]
fn test_passthrough_step_appears_in_step_names() -> Result<(), FerroError> {
// Live oracle (sklearn 1.5.2):
// p = Pipeline([('p','passthrough'),('ss',StandardScaler())]).fit(X)
// list(p.named_steps.keys()) -> ['p', 'ss']
// p['p'] -> 'passthrough' (still visible)
// A passthrough step is a real, named step: it shows up in
// step_names()/named_steps() in order, exactly like sklearn.
let pipeline = Pipeline::new()
.passthrough_step("p")
.transform_step("dbl", Box::new(DoublingTransformer))
.estimator_step("clf", Box::new(SumEstimator));
assert_eq!(pipeline.step_names(), vec!["p", "dbl", "clf"]);
let named: Vec<&str> = pipeline.named_steps().iter().map(|(n, _)| *n).collect();
assert_eq!(named, vec!["p", "dbl", "clf"]);
// The passthrough step is a transformer-kind step (reachable by name).
assert!(matches!(
pipeline.named_step("p"),
Some(PipelineStepRef::Transformer(_))
));
// And it survives onto the fitted pipeline's introspection.
let x = Array2::<f64>::zeros((2, 2));
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = pipeline.fit(&x, &y)?;
assert_eq!(fitted.step_names(), vec!["p", "dbl", "clf"]);
assert!(matches!(
fitted.named_step("p"),
Some(FittedPipelineStepRef::Transformer(_))
));
Ok(())
}
#[test]
fn test_passthrough_transformer_standalone_identity() -> Result<(), FerroError> {
// A standalone PassthroughTransformer: fit_pipeline + transform_pipeline
// is the identity (the building block the no-op step is made of). This is
// the pointwise restatement of sklearn's 'passthrough' == identity
// (Pipeline([('p','passthrough')]).transform(X) == X, live 1.5.2).
let p = PassthroughTransformer::<f64>::new();
let x = ndarray::array![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
let y = Array1::from_vec(vec![0.0, 1.0]);
let fitted = p.fit_pipeline(&x, &y)?;
assert_eq!(fitted.transform_pipeline(&x)?, x);
// Default constructs the same no-op.
let fitted2 = PassthroughTransformer::<f64>::default().fit_pipeline(&x, &y)?;
assert_eq!(fitted2.transform_pipeline(&x)?, x);
// The fitted half also has a public constructor/Default.
assert_eq!(
FittedPassthroughTransformer::<f64>::new().transform_pipeline(&x)?,
x
);
Ok(())
}
#[test]
fn test_passthrough_transformer_f32() -> Result<(), FerroError> {
// f32 generic support: the identity no-op for f32 data.
let pipeline = Pipeline::<f32>::new()
.passthrough_step("p")
.transform_step("dbl", Box::new(DoublingTransformerF32))
.estimator_step("sum", Box::new(SumEstimatorF32));
let x = ndarray::array![[1.0f32, 2.0], [3.0, 4.0]];
let y = Array1::from_vec(vec![0.0f32, 1.0]);
let fitted = pipeline.fit(&x, &y)?;
// passthrough then doubler == doubler alone.
assert_eq!(fitted.transform(&x)?, x.mapv(|v| v * 2.0));
Ok(())
}
#[test]
fn test_passthrough_transformer_is_send_sync() {
fn assert_send_sync<T: Send + Sync>() {}
assert_send_sync::<PassthroughTransformer<f64>>();
assert_send_sync::<PassthroughTransformer<f32>>();
assert_send_sync::<FittedPassthroughTransformer<f64>>();
assert_send_sync::<FittedPassthroughTransformer<f32>>();
}
// -- REQ-8: FeatureUnion -------------------------------------------------
/// A transformer that returns its input columns unchanged (width-preserving,
/// the OneToOneFeatureMixin shape — like sklearn's StandardScaler).
struct IdentityTransformer;
impl PipelineTransformer<f64> for IdentityTransformer {
fn fit_pipeline(
&self,
_x: &Array2<f64>,
_y: &Array1<f64>,
) -> Result<Box<dyn FittedPipelineTransformer<f64>>, FerroError> {
Ok(Box::new(FittedIdentityTransformer))
}
}
struct FittedIdentityTransformer;
impl FittedPipelineTransformer<f64> for FittedIdentityTransformer {
fn transform_pipeline(&self, x: &Array2<f64>) -> Result<Array2<f64>, FerroError> {
Ok(x.clone())
}
}
/// A transformer that emits a single column: the row sum (width 1, regardless
/// of input width). Used to exercise mixed-width hstack blocks.
struct RowSumTransformer;
impl PipelineTransformer<f64> for RowSumTransformer {
fn fit_pipeline(
&self,
_x: &Array2<f64>,
_y: &Array1<f64>,
) -> Result<Box<dyn FittedPipelineTransformer<f64>>, FerroError> {
Ok(Box::new(FittedRowSumTransformer))
}
}
struct FittedRowSumTransformer;
impl FittedPipelineTransformer<f64> for FittedRowSumTransformer {
fn transform_pipeline(&self, x: &Array2<f64>) -> Result<Array2<f64>, FerroError> {
let sums: Vec<f64> = x.rows().into_iter().map(|r| r.sum()).collect();
Array2::from_shape_vec((x.nrows(), 1), sums).map_err(|e| FerroError::InvalidParameter {
name: "x".into(),
reason: e.to_string(),
})
}
}
#[test]
fn test_feature_union_hstack_layout() -> Result<(), FerroError> {
// sklearn (live, 1.5.2):
// from sklearn.pipeline import FeatureUnion
// from sklearn.preprocessing import StandardScaler, MinMaxScaler
// import numpy as np
// X = np.array([[1.,2.],[3.,4.],[5.,6.]])
// fu = FeatureUnion([('ss',StandardScaler()),('mm',MinMaxScaler())]).fit(X)
// fu.transform(X).shape -> (3, 4)
// # columns = [ss_col0, ss_col1, mm_col0, mm_col1] (each transformer's
// # full output, concatenated left-to-right in transformer_list order)
// The hstack STRUCTURE is what's asserted here: two width-2 identity
// transformers → a width-4 output whose column blocks are each
// transformer's full output (here, the unchanged input twice). The block
// layout (transformer 0's cols, then transformer 1's cols) IS sklearn's
// _hstack ordering (pipeline.py:1812 np.hstack(Xs)).
let union = FeatureUnion::<f64>::new()
.with_transformer("a", Box::new(IdentityTransformer))
.with_transformer("b", Box::new(IdentityTransformer));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
let fitted = union.fit(&x, &())?;
let out = fitted.transform(&x)?;
// Width = sum of widths = 2 + 2 = 4; rows preserved.
assert_eq!(out.dim(), (3, 4));
// Block 0 (cols 0..2) = transformer "a"'s output (== x).
assert_eq!(out.slice(ndarray::s![.., 0..2]).to_owned(), x);
// Block 1 (cols 2..4) = transformer "b"'s output (== x).
assert_eq!(out.slice(ndarray::s![.., 2..4]).to_owned(), x);
Ok(())
}
#[test]
fn test_feature_union_get_feature_names_out() -> Result<(), FerroError> {
// sklearn (live, 1.5.2): the SAME union as above ->
// list(fu.get_feature_names_out())
// == ['ss__x0','ss__x1','mm__x0','mm__x1']
// i.e. each transformer's positional output names ('x0','x1' — the
// OneToOneFeatureMixin default for StandardScaler/MinMaxScaler) prefixed
// by '{name}__' (verbose_feature_names_out=True default, pipeline.py:1608).
// ferrolearn's identity transformers are the width-preserving analog, so
// the NAMING semantics (prefix + positional x{j}) match exactly.
let union = FeatureUnion::<f64>::new()
.with_transformer("ss", Box::new(IdentityTransformer))
.with_transformer("mm", Box::new(IdentityTransformer));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
let fitted = union.fit(&x, &())?;
assert_eq!(
fitted.get_feature_names_out(),
vec!["ss__x0", "ss__x1", "mm__x0", "mm__x1"]
);
// transformer_names() preserves union order; n_transformers/n_features_out.
assert_eq!(fitted.transformer_names(), vec!["ss", "mm"]);
assert_eq!(fitted.n_transformers(), 2);
assert_eq!(fitted.n_features_out(), 4);
Ok(())
}
#[test]
fn test_feature_union_single_transformer_width() -> Result<(), FerroError> {
// sklearn (live, 1.5.2):
// FeatureUnion([('ss',StandardScaler())]).fit(X).transform(X).shape
// -> (3, 2) (single block == that transformer's width)
// get_feature_names_out() -> ['ss__x0','ss__x1']
let union =
FeatureUnion::<f64>::new().with_transformer("ss", Box::new(IdentityTransformer));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
let fitted = union.fit(&x, &())?;
let out = fitted.transform(&x)?;
assert_eq!(out.dim(), (3, 2));
assert_eq!(out, x);
assert_eq!(fitted.get_feature_names_out(), vec!["ss__x0", "ss__x1"]);
Ok(())
}
#[test]
fn test_feature_union_mixed_widths() -> Result<(), FerroError> {
// sklearn (live, 1.5.2) — a union whose transformers emit DIFFERENT
// widths concatenates their blocks correctly. Oracle (StandardScaler
// keeps 3 cols, PCA(1) emits 1):
// X = np.array([[1.,2.,3.],[3.,4.,5.],[5.,6.,7.]])
// fu = FeatureUnion([('ss',StandardScaler()),('pca',PCA(1))]).fit(X)
// fu.transform(X).shape -> (3, 4) (3 + 1)
// list(fu.get_feature_names_out())
// -> ['ss__x0','ss__x1','ss__x2','pca__pca0']
// ferrolearn analog: a width-3 identity + a width-1 row-sum transformer.
// The STRUCTURE (block 0 width 3, block 1 width 1; total 4) is sklearn's.
// (Names: ferrolearn uses the positional x{j} suffix for both blocks —
// the documented OneToOneFeatureMixin default, since the trait objects
// expose no per-output names.)
let union = FeatureUnion::<f64>::new()
.with_transformer("ident", Box::new(IdentityTransformer))
.with_transformer("rowsum", Box::new(RowSumTransformer));
let x = ndarray::array![[1.0, 2.0, 3.0], [3.0, 4.0, 5.0], [5.0, 6.0, 7.0]];
let fitted = union.fit(&x, &())?;
let out = fitted.transform(&x)?;
// 3 (identity) + 1 (row sum) = 4 columns.
assert_eq!(out.dim(), (3, 4));
// Block 0 == x (identity).
assert_eq!(out.slice(ndarray::s![.., 0..3]).to_owned(), x);
// Block 1 == row sums.
let expected_sums = ndarray::array![[6.0], [12.0], [18.0]];
assert_eq!(out.slice(ndarray::s![.., 3..4]).to_owned(), expected_sums);
// Feature names reflect the per-block widths.
assert_eq!(
fitted.get_feature_names_out(),
vec!["ident__x0", "ident__x1", "ident__x2", "rowsum__x0"]
);
Ok(())
}
#[test]
fn test_feature_union_empty() -> Result<(), FerroError> {
// An empty union fits OK and transforms to a (n_samples, 0) matrix — the
// ferrolearn analog of sklearn's empty-hstack branch
// `if not Xs: return np.zeros((X.shape[0], 0))` (pipeline.py:1808).
// (sklearn's PUBLIC FeatureUnion([]).fit raises at _validate_transformers'
// `zip(*[])`, a Python-tuple-unpack artifact, not a numerical contract —
// R-DEV-4: ferrolearn has no such unpack, and the empty-hstack shape is
// the documented (n, 0) result.)
let union = FeatureUnion::<f64>::new();
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0]];
let fitted = union.fit(&x, &())?;
let out = fitted.transform(&x)?;
assert_eq!(out.dim(), (2, 0));
assert!(fitted.get_feature_names_out().is_empty());
assert_eq!(fitted.n_features_out(), 0);
Ok(())
}
#[test]
fn test_feature_union_row_count_consistency() -> Result<(), FerroError> {
// Every sub-output has n_rows == X.nrows(); the hstacked result preserves
// the row count (live oracle: FeatureUnion outputs have X.shape[0] rows).
let union = FeatureUnion::<f64>::new()
.with_transformer("a", Box::new(IdentityTransformer))
.with_transformer("b", Box::new(RowSumTransformer));
let x = ndarray::array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = union.fit(&x, &())?;
let out = fitted.transform(&x)?;
assert_eq!(out.nrows(), x.nrows());
Ok(())
}
#[test]
fn test_feature_union_f32() -> Result<(), FerroError> {
// f32 generic support: same hstack layout for f32 data.
let union = FeatureUnion::<f32>::new()
.with_transformer("a", Box::new(IdentityTransformerF32))
.with_transformer("b", Box::new(IdentityTransformerF32));
let x = ndarray::array![[1.0f32, 2.0], [3.0, 4.0]];
let fitted = union.fit(&x, &())?;
let out = fitted.transform(&x)?;
assert_eq!(out.dim(), (2, 4));
assert_eq!(out.slice(ndarray::s![.., 0..2]).to_owned(), x);
assert_eq!(out.slice(ndarray::s![.., 2..4]).to_owned(), x);
Ok(())
}
/// f32 identity transformer (width-preserving) for the f32 union test.
struct IdentityTransformerF32;
impl PipelineTransformer<f32> for IdentityTransformerF32 {
fn fit_pipeline(
&self,
_x: &Array2<f32>,
_y: &Array1<f32>,
) -> Result<Box<dyn FittedPipelineTransformer<f32>>, FerroError> {
Ok(Box::new(FittedIdentityTransformerF32))
}
}
struct FittedIdentityTransformerF32;
impl FittedPipelineTransformer<f32> for FittedIdentityTransformerF32 {
fn transform_pipeline(&self, x: &Array2<f32>) -> Result<Array2<f32>, FerroError> {
Ok(x.clone())
}
}
#[test]
fn test_feature_union_is_send_sync() {
fn assert_send_sync<T: Send + Sync>() {}
assert_send_sync::<FeatureUnion<f64>>();
assert_send_sync::<FeatureUnion<f32>>();
assert_send_sync::<FittedFeatureUnion<f64>>();
assert_send_sync::<FittedFeatureUnion<f32>>();
}
}