Skip to main content

OneHotEncoder

Struct OneHotEncoder 

Source
pub struct OneHotEncoder<F> { /* private fields */ }
Expand description

An unfitted one-hot encoder for multi-column numeric categorical data.

Input: Array2<F> where each column contains the (finite) numeric category values. Calling Fit::fit learns, per column, the sorted unique set of values (categories_) and returns a FittedOneHotEncoder. The output of Transform::transform is a dense binary matrix with one column per learned category, the per-feature blocks concatenated left-to-right.

§Examples

use ferrolearn_preprocess::OneHotEncoder;
use ferrolearn_core::traits::{Fit, Transform};
use ndarray::array;

let enc = OneHotEncoder::<f64>::new();
// Non-contiguous categories {2, 5, 9} in column 0, {0, 1} in column 1.
let x = array![[2.0_f64, 0.0], [5.0, 1.0], [9.0, 0.0], [5.0, 1.0]];
let fitted = enc.fit(&x, &()).unwrap();
assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0], vec![0.0, 1.0]]);
let encoded = fitted.transform(&x).unwrap();
assert_eq!(encoded.ncols(), 5); // 3 + 2 category columns

Unknown categories at transform time are, by default, rejected (OneHotHandleUnknown::Error, scikit-learn’s handle_unknown='error'). Configuring with_handle_unknown with OneHotHandleUnknown::Ignore instead encodes an unknown category as an all-zero one-hot block, matching OneHotEncoder(handle_unknown='ignore').

Implementations§

Source§

impl<F: Float + Send + Sync + 'static> OneHotEncoder<F>

Source

pub fn new() -> Self

Create a new OneHotEncoder with scikit-learn’s default handle_unknown='error' (OneHotHandleUnknown::Error).

Source

pub fn with_handle_unknown(self, handle_unknown: OneHotHandleUnknown) -> Self

Set the unknown-category strategy (handle_unknown).

With OneHotHandleUnknown::Ignore an unknown category at transform time becomes an all-zero one-hot block for that feature instead of an error, matching scikit-learn’s OneHotEncoder(handle_unknown='ignore') (_encoders.py:215-240).

Source

pub fn handle_unknown(&self) -> OneHotHandleUnknown

Return the configured unknown-category strategy (handle_unknown).

Source

pub fn with_drop(self, drop: OneHotDrop) -> Self

Set the drop strategy (drop).

With OneHotDrop::First the first category of every feature is dropped from the output; with OneHotDrop::IfBinary only binary (2-category) features lose their first category. The dropped category produces an all-zero one-hot block, matching scikit-learn’s OneHotEncoder(drop=...) (_encoders.py:498-516).

Source

pub fn drop(&self) -> OneHotDrop

Return the configured drop strategy (drop).

Source

pub fn with_min_frequency(self, min_frequency: usize) -> Self

Set the minimum-frequency threshold for infrequent grouping (min_frequency, integer count).

At fit time a category whose count in the training data is strictly less than min_frequency is grouped into a single trailing “infrequent” output column for that feature, matching scikit-learn’s OneHotEncoder(min_frequency=...) integer form (_encoders.py:566-577, _identify_infrequent :295-296 category_count < self.min_frequency).

Enabling infrequent grouping (min_frequency and/or max_categories) requires drop == OneHotDrop::None_; combining it with drop is a deferred interaction (REQ-5a×5b) and Fit::fit returns an error.

SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn also accepts a FLOAT min_frequency interpreted as the fraction min_frequency * n_samples (_encoders.py:573-575,:297-299); the float-fraction form is NOT-STARTED here.

Source

pub fn with_max_categories(self, max_categories: usize) -> Self

Set the maximum number of output columns per feature for infrequent grouping (max_categories).

At fit time, if a feature would otherwise produce more than max_categories output columns, the least-frequent categories are grouped into a single trailing “infrequent” column so the block width is at most max_categories (the infrequent column itself counts toward the limit). Mirrors scikit-learn’s OneHotEncoder(max_categories=...) (_encoders.py:579-587, _identify_infrequent :303-315).

Enabling infrequent grouping requires drop == OneHotDrop::None_ (see Self::with_min_frequency).

Source

pub fn min_frequency(&self) -> Option<usize>

Return the configured minimum-frequency threshold (min_frequency), or None if infrequent grouping by frequency is disabled.

Source

pub fn max_categories(&self) -> Option<usize>

Return the configured maximum output-column limit (max_categories), or None if no limit is imposed.

Trait Implementations§

Source§

impl<F: Clone> Clone for OneHotEncoder<F>

Source§

fn clone(&self) -> OneHotEncoder<F>

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<F: Debug> Debug for OneHotEncoder<F>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<F: Float + Send + Sync + 'static> Default for OneHotEncoder<F>

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl<F: Float + Send + Sync + 'static> Fit<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>, ()> for OneHotEncoder<F>

Source§

fn fit( &self, x: &Array2<F>, _y: &(), ) -> Result<FittedOneHotEncoder<F>, FerroError>

Fit the encoder by learning the sorted-unique category set per column.

For each input column j, categories_[j] is the distinct values of that column, sorted ascending via partial_cmp and deduped by exact equality — mirroring scikit-learn’s categories_ = _unique(Xi) (sklearn/preprocessing/_encoders.py:99, np.unique per column). The output-column layout (offsets, n_output) is precomputed as the prefix sums / total of the per-column category counts.

Exact float equality is what np.unique does, so two values that differ by an ULP are distinct categories here, exactly as in sklearn.

§NaN handling (#2223)

NaN is treated as a valid category, matching sklearn’s _unique_np (_encode.py:70-74): it sorts LAST and a run of duplicate NaNs collapses to a SINGLE sorted-last category (the sort orders NaN after every finite value and dedup_by collapses consecutive NaNs, since NaN != NaN). A NaN cell at transform then one-hots that trailing category. fit never panics (R-CODE-2).

§Errors

Returns FerroError::InsufficientSamples if the input has zero rows (matching sklearn’s check_array minimum-of-1-sample requirement).

Source§

type Fitted = FittedOneHotEncoder<F>

The fitted model type returned by fit.
Source§

type Error = FerroError

The error type returned by fit.
Source§

impl<F: Float + Send + Sync + 'static> FitTransform<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>> for OneHotEncoder<F>

Source§

fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError>

Fit the encoder on x and return the one-hot encoded output in one step.

§Errors

Returns an error if fitting or transformation fails.

Source§

type FitError = FerroError

The error type for the combined fit-transform operation.
Source§

impl<F: Float + Send + Sync + 'static> Transform<ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>> for OneHotEncoder<F>

Implement Transform on the unfitted encoder to satisfy the FitTransform: Transform supertrait bound. Calling transform on an unfitted encoder always returns an error.

Source§

fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError>

Always returns an error — the encoder must be fitted first.

Use Fit::fit to produce a FittedOneHotEncoder, then call Transform::transform on that.

Source§

type Output = ArrayBase<OwnedRepr<F>, Dim<[usize; 2]>>

The transformed output type.
Source§

type Error = FerroError

The error type returned by transform.

Auto Trait Implementations§

§

impl<F> Freeze for OneHotEncoder<F>

§

impl<F> RefUnwindSafe for OneHotEncoder<F>
where F: RefUnwindSafe,

§

impl<F> Send for OneHotEncoder<F>
where F: Send,

§

impl<F> Sync for OneHotEncoder<F>
where F: Sync,

§

impl<F> Unpin for OneHotEncoder<F>
where F: Unpin,

§

impl<F> UnsafeUnpin for OneHotEncoder<F>

§

impl<F> UnwindSafe for OneHotEncoder<F>
where F: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> ByRef<T> for T

Source§

fn by_ref(&self) -> &T

Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> DistributionExt for T
where T: ?Sized,

Source§

fn rand<T>(&self, rng: &mut (impl Rng + ?Sized)) -> T
where Self: Distribution<T>,

Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Imply<T> for U
where T: ?Sized, U: ?Sized,

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<SS, SP> SupersetOf<SS> for SP
where SS: SubsetOf<SP>,

Source§

fn to_subset(&self) -> Option<SS>

The inverse inclusion map: attempts to construct self from the equivalent element of its superset. Read more
Source§

fn is_in_subset(&self) -> bool

Checks if self is actually part of its subset T (and can be converted to it).
Source§

fn to_subset_unchecked(&self) -> SS

Use with care! Same as self.to_subset but without any property checks. Always succeeds.
Source§

fn from_subset(element: &SS) -> SP

The inclusion map: converts self to the equivalent element of its superset.
Source§

impl<SS, SP> SupersetOf<SS> for SP
where SS: SubsetOf<SP>,

Source§

fn to_subset(&self) -> Option<SS>

The inverse inclusion map: attempts to construct self from the equivalent element of its superset. Read more
Source§

fn is_in_subset(&self) -> bool

Checks if self is actually part of its subset T (and can be converted to it).
Source§

unsafe fn to_subset_unchecked(&self) -> SS

Use with care! Same as self.to_subset but without any property checks. Always succeeds.
Source§

fn from_subset(element: &SS) -> SP

The inclusion map: converts self to the equivalent element of its superset.
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V