pub struct OneHotEncoder<F> { /* private fields */ }Expand description
An unfitted one-hot encoder for multi-column numeric categorical data.
Input: Array2<F> where each column contains the (finite) numeric category
values. Calling Fit::fit learns, per column, the sorted unique set of
values (categories_) and returns a FittedOneHotEncoder. The output of
Transform::transform is a dense binary matrix with one column per learned
category, the per-feature blocks concatenated left-to-right.
§Examples
use ferrolearn_preprocess::OneHotEncoder;
use ferrolearn_core::traits::{Fit, Transform};
use ndarray::array;
let enc = OneHotEncoder::<f64>::new();
// Non-contiguous categories {2, 5, 9} in column 0, {0, 1} in column 1.
let x = array![[2.0_f64, 0.0], [5.0, 1.0], [9.0, 0.0], [5.0, 1.0]];
let fitted = enc.fit(&x, &()).unwrap();
assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0], vec![0.0, 1.0]]);
let encoded = fitted.transform(&x).unwrap();
assert_eq!(encoded.ncols(), 5); // 3 + 2 category columnsUnknown categories at transform time are, by default, rejected
(OneHotHandleUnknown::Error, scikit-learn’s handle_unknown='error').
Configuring with_handle_unknown with
OneHotHandleUnknown::Ignore instead encodes an unknown category as an
all-zero one-hot block, matching OneHotEncoder(handle_unknown='ignore').
Implementations§
Source§impl<F: Float + Send + Sync + 'static> OneHotEncoder<F>
impl<F: Float + Send + Sync + 'static> OneHotEncoder<F>
Sourcepub fn new() -> Self
pub fn new() -> Self
Create a new OneHotEncoder with scikit-learn’s default
handle_unknown='error' (OneHotHandleUnknown::Error).
Sourcepub fn with_handle_unknown(self, handle_unknown: OneHotHandleUnknown) -> Self
pub fn with_handle_unknown(self, handle_unknown: OneHotHandleUnknown) -> Self
Set the unknown-category strategy (handle_unknown).
With OneHotHandleUnknown::Ignore an unknown category at transform
time becomes an all-zero one-hot block for that feature instead of an
error, matching scikit-learn’s OneHotEncoder(handle_unknown='ignore')
(_encoders.py:215-240).
Sourcepub fn handle_unknown(&self) -> OneHotHandleUnknown
pub fn handle_unknown(&self) -> OneHotHandleUnknown
Return the configured unknown-category strategy (handle_unknown).
Sourcepub fn with_drop(self, drop: OneHotDrop) -> Self
pub fn with_drop(self, drop: OneHotDrop) -> Self
Set the drop strategy (drop).
With OneHotDrop::First the first category of every feature is dropped
from the output; with OneHotDrop::IfBinary only binary (2-category)
features lose their first category. The dropped category produces an
all-zero one-hot block, matching scikit-learn’s OneHotEncoder(drop=...)
(_encoders.py:498-516).
Sourcepub fn drop(&self) -> OneHotDrop
pub fn drop(&self) -> OneHotDrop
Return the configured drop strategy (drop).
Sourcepub fn with_min_frequency(self, min_frequency: usize) -> Self
pub fn with_min_frequency(self, min_frequency: usize) -> Self
Set the minimum-frequency threshold for infrequent grouping
(min_frequency, integer count).
At fit time a category whose count in the training data is strictly
less than min_frequency is grouped into a single trailing
“infrequent” output column for that feature, matching scikit-learn’s
OneHotEncoder(min_frequency=...) integer form
(_encoders.py:566-577, _identify_infrequent :295-296
category_count < self.min_frequency).
Enabling infrequent grouping (min_frequency and/or max_categories)
requires drop == OneHotDrop::None_; combining it with drop is a
deferred interaction (REQ-5a×5b) and Fit::fit returns an error.
SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
also accepts a FLOAT min_frequency interpreted as the fraction
min_frequency * n_samples (_encoders.py:573-575,:297-299); the
float-fraction form is NOT-STARTED here.
Sourcepub fn with_max_categories(self, max_categories: usize) -> Self
pub fn with_max_categories(self, max_categories: usize) -> Self
Set the maximum number of output columns per feature for infrequent
grouping (max_categories).
At fit time, if a feature would otherwise produce more than
max_categories output columns, the least-frequent categories are
grouped into a single trailing “infrequent” column so the block width is
at most max_categories (the infrequent column itself counts toward the
limit). Mirrors scikit-learn’s OneHotEncoder(max_categories=...)
(_encoders.py:579-587, _identify_infrequent :303-315).
Enabling infrequent grouping requires drop == OneHotDrop::None_ (see
Self::with_min_frequency).
Sourcepub fn min_frequency(&self) -> Option<usize>
pub fn min_frequency(&self) -> Option<usize>
Return the configured minimum-frequency threshold (min_frequency), or
None if infrequent grouping by frequency is disabled.
Sourcepub fn max_categories(&self) -> Option<usize>
pub fn max_categories(&self) -> Option<usize>
Return the configured maximum output-column limit (max_categories), or
None if no limit is imposed.
Trait Implementations§
Source§impl<F: Clone> Clone for OneHotEncoder<F>
impl<F: Clone> Clone for OneHotEncoder<F>
Source§fn clone(&self) -> OneHotEncoder<F>
fn clone(&self) -> OneHotEncoder<F>
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl<F: Debug> Debug for OneHotEncoder<F>
impl<F: Debug> Debug for OneHotEncoder<F>
Source§impl<F: Float + Send + Sync + 'static> Fit<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>, ()> for OneHotEncoder<F>
impl<F: Float + Send + Sync + 'static> Fit<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>, ()> for OneHotEncoder<F>
Source§fn fit(
&self,
x: &Array2<F>,
_y: &(),
) -> Result<FittedOneHotEncoder<F>, FerroError>
fn fit( &self, x: &Array2<F>, _y: &(), ) -> Result<FittedOneHotEncoder<F>, FerroError>
Fit the encoder by learning the sorted-unique category set per column.
For each input column j, categories_[j] is the distinct values of that
column, sorted ascending via partial_cmp and deduped by exact
equality — mirroring scikit-learn’s categories_ = _unique(Xi)
(sklearn/preprocessing/_encoders.py:99, np.unique per column).
The output-column layout (offsets, n_output) is precomputed as the
prefix sums / total of the per-column category counts.
Exact float equality is what np.unique does, so two values that differ
by an ULP are distinct categories here, exactly as in sklearn.
§NaN handling (#2223)
NaN is treated as a valid category, matching sklearn’s _unique_np
(_encode.py:70-74): it sorts LAST and a run of duplicate NaNs collapses
to a SINGLE sorted-last category (the sort orders NaN after every finite
value and dedup_by collapses consecutive NaNs, since NaN != NaN). A
NaN cell at transform then one-hots that trailing category. fit never
panics (R-CODE-2).
§Errors
Returns FerroError::InsufficientSamples if the input has zero rows
(matching sklearn’s check_array minimum-of-1-sample requirement).
Source§type Fitted = FittedOneHotEncoder<F>
type Fitted = FittedOneHotEncoder<F>
fit.Source§type Error = FerroError
type Error = FerroError
fit.Source§impl<F: Float + Send + Sync + 'static> FitTransform<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>> for OneHotEncoder<F>
impl<F: Float + Send + Sync + 'static> FitTransform<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>> for OneHotEncoder<F>
Source§fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError>
fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError>
Fit the encoder on x and return the one-hot encoded output in one step.
§Errors
Returns an error if fitting or transformation fails.
Source§type FitError = FerroError
type FitError = FerroError
Source§impl<F: Float + Send + Sync + 'static> Transform<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>> for OneHotEncoder<F>
Implement Transform on the unfitted encoder to satisfy the FitTransform: Transform
supertrait bound. Calling transform on an unfitted encoder always returns an error.
impl<F: Float + Send + Sync + 'static> Transform<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>> for OneHotEncoder<F>
Implement Transform on the unfitted encoder to satisfy the FitTransform: Transform
supertrait bound. Calling transform on an unfitted encoder always returns an error.
Source§fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError>
fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError>
Always returns an error — the encoder must be fitted first.
Use Fit::fit to produce a FittedOneHotEncoder, then call
Transform::transform on that.
Source§type Error = FerroError
type Error = FerroError
transform.Auto Trait Implementations§
impl<F> Freeze for OneHotEncoder<F>
impl<F> RefUnwindSafe for OneHotEncoder<F>where
F: RefUnwindSafe,
impl<F> Send for OneHotEncoder<F>where
F: Send,
impl<F> Sync for OneHotEncoder<F>where
F: Sync,
impl<F> Unpin for OneHotEncoder<F>where
F: Unpin,
impl<F> UnsafeUnpin for OneHotEncoder<F>
impl<F> UnwindSafe for OneHotEncoder<F>where
F: UnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> DistributionExt for Twhere
T: ?Sized,
impl<T> DistributionExt for Twhere
T: ?Sized,
impl<T, U> Imply<T> for U
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§unsafe fn to_subset_unchecked(&self) -> SS
unsafe fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.