pub struct TextFeatureSelector<State = Untrained> { /* private fields */ }Expand description
Text feature selection using TF-IDF weights and linguistic analysis
This selector analyzes text features represented as term frequency matrices and applies various text-specific selection criteria:
- Document frequency filtering (min_df, max_df)
- TF-IDF scoring for term importance
- Chi-squared statistical tests with target variables
- N-gram analysis (when configured)
- Part-of-speech and syntactic features (when enabled)
§Input Format
The input matrix X should be structured as:
- Rows: Documents
- Columns: Terms/features (e.g., from TF-IDF vectorization)
- Values: Term frequencies or TF-IDF scores
§Examples
use sklears_feature_selection::domain_specific::text_features::TextFeatureSelector;
use sklears_core::traits::{Fit, Transform};
use scirs2_core::ndarray::{Array1, Array2};
let selector = TextFeatureSelector::new()
.min_df(0.02) // Minimum 2% document frequency
.max_df(0.90) // Maximum 90% document frequency
.max_features(Some(500)) // Select top 500 features
.ngram_range((1, 2)) // Include unigrams and bigrams
.include_pos(true); // Include part-of-speech features
let x = Array2::zeros((100, 1000)); // 100 documents, 1000 terms
let y = Array1::zeros(100); // Document labels
let fitted_selector = selector.fit(&x, &y)?;
let transformed_x = fitted_selector.transform(&x)?;Implementations§
Source§impl TextFeatureSelector<Untrained>
impl TextFeatureSelector<Untrained>
pub fn new() -> Self
Sourcepub fn min_df(self, min_df: f64) -> Self
pub fn min_df(self, min_df: f64) -> Self
Set the minimum document frequency threshold
Terms appearing in fewer than min_df fraction of documents
will be filtered out. This helps remove very rare terms that
may not be reliable predictors.
§Arguments
min_df- Fraction between 0.0 and 1.0
Sourcepub fn max_df(self, max_df: f64) -> Self
pub fn max_df(self, max_df: f64) -> Self
Set the maximum document frequency threshold
Terms appearing in more than max_df fraction of documents
will be filtered out. This helps remove very common terms
(like stop words) that may not be discriminative.
§Arguments
max_df- Fraction between 0.0 and 1.0
Sourcepub fn max_features(self, max_features: Option<usize>) -> Self
pub fn max_features(self, max_features: Option<usize>) -> Self
Set the maximum number of features to select
When set to Some(n), selects the top n features by combined score.
When set to None, uses document frequency filtering only.
Sourcepub fn ngram_range(self, ngram_range: (usize, usize)) -> Self
pub fn ngram_range(self, ngram_range: (usize, usize)) -> Self
Set the n-gram range for feature extraction
- (1, 1): Unigrams only
- (1, 2): Unigrams and bigrams
- (2, 3): Bigrams and trigrams
- etc.
Note: This parameter is informational for compatibility; actual n-gram extraction should be done during preprocessing.
Sourcepub fn include_pos(self, include_pos: bool) -> Self
pub fn include_pos(self, include_pos: bool) -> Self
Enable or disable part-of-speech features
When enabled, the selector will give preference to features that represent important part-of-speech categories.
Note: This requires preprocessing to extract POS features.
Sourcepub fn include_syntax(self, include_syntax: bool) -> Self
pub fn include_syntax(self, include_syntax: bool) -> Self
Enable or disable syntactic features
When enabled, the selector will consider syntactic relationships and dependency parsing features.
Note: This requires preprocessing to extract syntactic features.
Source§impl TextFeatureSelector<Trained>
impl TextFeatureSelector<Trained>
Sourcepub fn vocabulary(&self) -> &HashMap<String, usize>
pub fn vocabulary(&self) -> &HashMap<String, usize>
Get the vocabulary mapping from terms to feature indices
Returns a reference to the vocabulary dictionary where keys are term names and values are their corresponding feature indices.
Sourcepub fn idf_scores(&self) -> &Array1<Float>
pub fn idf_scores(&self) -> &Array1<Float>
Get the IDF (Inverse Document Frequency) scores
Returns an array where each element is the IDF score for the corresponding selected feature.
Sourcepub fn feature_names(&self) -> &[String]
pub fn feature_names(&self) -> &[String]
Get the names of selected features
Returns a reference to the vector of feature names that were selected.
Sourcepub fn selected_features(&self) -> &[usize]
pub fn selected_features(&self) -> &[usize]
Get the indices of selected features
Returns a reference to the vector of original feature indices that were selected during fitting.
Sourcepub fn n_features_selected(&self) -> usize
pub fn n_features_selected(&self) -> usize
Get the number of selected features
Trait Implementations§
Source§impl<State: Clone> Clone for TextFeatureSelector<State>
impl<State: Clone> Clone for TextFeatureSelector<State>
Source§fn clone(&self) -> TextFeatureSelector<State>
fn clone(&self) -> TextFeatureSelector<State>
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl<State: Debug> Debug for TextFeatureSelector<State>
impl<State: Debug> Debug for TextFeatureSelector<State>
Source§impl Default for TextFeatureSelector<Untrained>
impl Default for TextFeatureSelector<Untrained>
Source§impl Estimator for TextFeatureSelector<Untrained>
impl Estimator for TextFeatureSelector<Untrained>
Source§type Error = SklearsError
type Error = SklearsError
Source§fn validate_config(&self) -> Result<(), SklearsError>
fn validate_config(&self) -> Result<(), SklearsError>
Source§fn check_compatibility(
&self,
n_samples: usize,
n_features: usize,
) -> Result<(), SklearsError>
fn check_compatibility( &self, n_samples: usize, n_features: usize, ) -> Result<(), SklearsError>
Source§fn metadata(&self) -> EstimatorMetadata
fn metadata(&self) -> EstimatorMetadata
Source§impl Fit<ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>, ArrayBase<OwnedRepr<f64>, Dim<[usize; 1]>>> for TextFeatureSelector<Untrained>
impl Fit<ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>, ArrayBase<OwnedRepr<f64>, Dim<[usize; 1]>>> for TextFeatureSelector<Untrained>
Source§type Fitted = TextFeatureSelector<Trained>
type Fitted = TextFeatureSelector<Trained>
Source§fn fit(self, x: &Array2<Float>, y: &Array1<Float>) -> SklResult<Self::Fitted>
fn fit(self, x: &Array2<Float>, y: &Array1<Float>) -> SklResult<Self::Fitted>
Source§fn fit_with_validation(
self,
x: &X,
y: &Y,
_x_val: Option<&X>,
_y_val: Option<&Y>,
) -> Result<(Self::Fitted, FitMetrics), SklearsError>where
Self: Sized,
fn fit_with_validation(
self,
x: &X,
y: &Y,
_x_val: Option<&X>,
_y_val: Option<&Y>,
) -> Result<(Self::Fitted, FitMetrics), SklearsError>where
Self: Sized,
Source§impl SelectorMixin for TextFeatureSelector<Trained>
impl SelectorMixin for TextFeatureSelector<Trained>
Auto Trait Implementations§
impl<State> Freeze for TextFeatureSelector<State>
impl<State> RefUnwindSafe for TextFeatureSelector<State>where
State: RefUnwindSafe,
impl<State> Send for TextFeatureSelector<State>where
State: Send,
impl<State> Sync for TextFeatureSelector<State>where
State: Sync,
impl<State> Unpin for TextFeatureSelector<State>where
State: Unpin,
impl<State> UnwindSafe for TextFeatureSelector<State>where
State: UnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> StableApi for Twhere
T: Estimator,
impl<T> StableApi for Twhere
T: Estimator,
Source§const STABLE_SINCE: &'static str = "0.1.0"
const STABLE_SINCE: &'static str = "0.1.0"
Source§const HAS_EXPERIMENTAL_FEATURES: bool = false
const HAS_EXPERIMENTAL_FEATURES: bool = false
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.