TextFeatureSelector

Struct TextFeatureSelector 

Source
pub struct TextFeatureSelector<State = Untrained> { /* private fields */ }
Expand description

Text feature selection using TF-IDF weights and linguistic analysis

This selector analyzes text features represented as term frequency matrices and applies various text-specific selection criteria:

  • Document frequency filtering (min_df, max_df)
  • TF-IDF scoring for term importance
  • Chi-squared statistical tests with target variables
  • N-gram analysis (when configured)
  • Part-of-speech and syntactic features (when enabled)

§Input Format

The input matrix X should be structured as:

  • Rows: Documents
  • Columns: Terms/features (e.g., from TF-IDF vectorization)
  • Values: Term frequencies or TF-IDF scores

§Examples

use sklears_feature_selection::domain_specific::text_features::TextFeatureSelector;
use sklears_core::traits::{Fit, Transform};
use scirs2_core::ndarray::{Array1, Array2};

let selector = TextFeatureSelector::new()
    .min_df(0.02)              // Minimum 2% document frequency
    .max_df(0.90)              // Maximum 90% document frequency
    .max_features(Some(500))   // Select top 500 features
    .ngram_range((1, 2))       // Include unigrams and bigrams
    .include_pos(true);        // Include part-of-speech features

let x = Array2::zeros((100, 1000)); // 100 documents, 1000 terms
let y = Array1::zeros(100);          // Document labels

let fitted_selector = selector.fit(&x, &y)?;
let transformed_x = fitted_selector.transform(&x)?;

Implementations§

Source§

impl TextFeatureSelector<Untrained>

Source

pub fn new() -> Self

Source

pub fn min_df(self, min_df: f64) -> Self

Set the minimum document frequency threshold

Terms appearing in fewer than min_df fraction of documents will be filtered out. This helps remove very rare terms that may not be reliable predictors.

§Arguments
  • min_df - Fraction between 0.0 and 1.0
Source

pub fn max_df(self, max_df: f64) -> Self

Set the maximum document frequency threshold

Terms appearing in more than max_df fraction of documents will be filtered out. This helps remove very common terms (like stop words) that may not be discriminative.

§Arguments
  • max_df - Fraction between 0.0 and 1.0
Source

pub fn max_features(self, max_features: Option<usize>) -> Self

Set the maximum number of features to select

When set to Some(n), selects the top n features by combined score. When set to None, uses document frequency filtering only.

Source

pub fn ngram_range(self, ngram_range: (usize, usize)) -> Self

Set the n-gram range for feature extraction

  • (1, 1): Unigrams only
  • (1, 2): Unigrams and bigrams
  • (2, 3): Bigrams and trigrams
  • etc.

Note: This parameter is informational for compatibility; actual n-gram extraction should be done during preprocessing.

Source

pub fn include_pos(self, include_pos: bool) -> Self

Enable or disable part-of-speech features

When enabled, the selector will give preference to features that represent important part-of-speech categories.

Note: This requires preprocessing to extract POS features.

Source

pub fn include_syntax(self, include_syntax: bool) -> Self

Enable or disable syntactic features

When enabled, the selector will consider syntactic relationships and dependency parsing features.

Note: This requires preprocessing to extract syntactic features.

Source§

impl TextFeatureSelector<Trained>

Source

pub fn vocabulary(&self) -> &HashMap<String, usize>

Get the vocabulary mapping from terms to feature indices

Returns a reference to the vocabulary dictionary where keys are term names and values are their corresponding feature indices.

Source

pub fn idf_scores(&self) -> &Array1<Float>

Get the IDF (Inverse Document Frequency) scores

Returns an array where each element is the IDF score for the corresponding selected feature.

Source

pub fn feature_names(&self) -> &[String]

Get the names of selected features

Returns a reference to the vector of feature names that were selected.

Source

pub fn selected_features(&self) -> &[usize]

Get the indices of selected features

Returns a reference to the vector of original feature indices that were selected during fitting.

Source

pub fn n_features_selected(&self) -> usize

Get the number of selected features

Source

pub fn feature_summary(&self) -> Vec<(usize, &str, Float)>

Get feature information as a structured summary

Returns a vector of tuples containing (feature_index, feature_name, idf_score) for all selected features, sorted by feature index.

Trait Implementations§

Source§

impl<State: Clone> Clone for TextFeatureSelector<State>

Source§

fn clone(&self) -> TextFeatureSelector<State>

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<State: Debug> Debug for TextFeatureSelector<State>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for TextFeatureSelector<Untrained>

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl Estimator for TextFeatureSelector<Untrained>

Source§

type Config = ()

Configuration type for the estimator
Source§

type Error = SklearsError

Error type for the estimator
Source§

type Float = f64

The numeric type used by this estimator
Source§

fn config(&self) -> &Self::Config

Get estimator configuration
Source§

fn validate_config(&self) -> Result<(), SklearsError>

Validate estimator configuration with detailed error context
Source§

fn check_compatibility( &self, n_samples: usize, n_features: usize, ) -> Result<(), SklearsError>

Check if estimator is compatible with given data dimensions
Source§

fn metadata(&self) -> EstimatorMetadata

Get estimator metadata
Source§

impl Fit<ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>, ArrayBase<OwnedRepr<f64>, Dim<[usize; 1]>>> for TextFeatureSelector<Untrained>

Source§

type Fitted = TextFeatureSelector<Trained>

The fitted model type
Source§

fn fit(self, x: &Array2<Float>, y: &Array1<Float>) -> SklResult<Self::Fitted>

Fit the model to the provided data with validation
Source§

fn fit_with_validation( self, x: &X, y: &Y, _x_val: Option<&X>, _y_val: Option<&Y>, ) -> Result<(Self::Fitted, FitMetrics), SklearsError>
where Self: Sized,

Fit with custom validation and early stopping
Source§

impl SelectorMixin for TextFeatureSelector<Trained>

Source§

fn get_support(&self) -> SklResult<Array1<bool>>

Get support mask
Source§

fn transform_features(&self, indices: &[usize]) -> SklResult<Vec<usize>>

Transform by selecting features
Source§

impl Transform<ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>> for TextFeatureSelector<Trained>

Source§

fn transform(&self, x: &Array2<Float>) -> SklResult<Array2<Float>>

Transform the input data

Auto Trait Implementations§

§

impl<State> Freeze for TextFeatureSelector<State>

§

impl<State> RefUnwindSafe for TextFeatureSelector<State>
where State: RefUnwindSafe,

§

impl<State> Send for TextFeatureSelector<State>
where State: Send,

§

impl<State> Sync for TextFeatureSelector<State>
where State: Sync,

§

impl<State> Unpin for TextFeatureSelector<State>
where State: Unpin,

§

impl<State> UnwindSafe for TextFeatureSelector<State>
where State: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> StableApi for T
where T: Estimator,

Source§

const STABLE_SINCE: &'static str = "0.1.0"

API version this type was stabilized in
Source§

const HAS_EXPERIMENTAL_FEATURES: bool = false

Whether this API has any experimental features
Source§

impl<SS, SP> SupersetOf<SS> for SP
where SS: SubsetOf<SP>,

Source§

fn to_subset(&self) -> Option<SS>

The inverse inclusion map: attempts to construct self from the equivalent element of its superset. Read more
Source§

fn is_in_subset(&self) -> bool

Checks if self is actually part of its subset T (and can be converted to it).
Source§

fn to_subset_unchecked(&self) -> SS

Use with care! Same as self.to_subset but without any property checks. Always succeeds.
Source§

fn from_subset(element: &SS) -> SP

The inclusion map: converts self to the equivalent element of its superset.
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V