pub struct CorpusFeatures {
pub n_items: usize,
pub n_categories: usize,
pub dim: usize,
pub mean_members_per_category: f64,
pub category_size_entropy: f64,
pub mean_sparsity: f64,
pub axis_utilization_entropy: f64,
pub noise_estimate: f64,
pub mean_intra_category_similarity: f64,
pub mean_inter_category_similarity: f64,
pub category_separation_ratio: f64,
}Expand description
Low-dimensional profile of a corpus. Computed once per corpus; fed
into any MetaModel to predict the
pipeline config that’s likely to work best on it.
Fields§
§n_items: usizeTotal item count.
n_categories: usizeUnique category count.
dim: usizeEmbedding dimensionality.
mean_members_per_category: f64n_items / n_categories.
category_size_entropy: f64Shannon entropy of the category-size distribution, normalized to
[0, 1] by dividing by log(n_categories). High = balanced
category sizes; low = heavily skewed.
mean_sparsity: f64Mean per-item active-axis fraction: |active axes| / dim, averaged
over items. An axis is active when |v_i| > active_threshold.
axis_utilization_entropy: f64Entropy of how often each axis is active across the corpus,
normalized by log(dim). High = all axes used similarly; low =
a few axes dominate.
noise_estimate: f64Median of |v_i| across inactive entries (|v_i| ≤ threshold),
averaged across items. A proxy for the noise floor.
mean_intra_category_similarity: f64Mean intra-category cosine similarity in embedding space. High = items within a category are tightly clustered.
mean_inter_category_similarity: f64Mean inter-category cosine similarity in embedding space. High = categories overlap heavily; low = categories are semantically distinct.
category_separation_ratio: f64mean_intra / max(mean_inter, eps). A ratio-based separation
signal: values > 1 mean categories separate well in embedding
space; values near 1 mean the corpus is difficult to partition.
Implementations§
Source§impl CorpusFeatures
impl CorpusFeatures
Sourcepub fn feature_names() -> [&'static str; 10]
pub fn feature_names() -> [&'static str; 10]
Stable feature names aligned with Self::to_vec. Useful for
logging, feature importance reports, and CSV headers.
Note: category_separation_ratio is deliberately excluded — it’s
a derived ratio of two features already named here, so including
it would double-count under any distance metric. See
Self::to_vec.
Sourcepub fn to_vec(&self) -> [f64; 10]
pub fn to_vec(&self) -> [f64; 10]
Fixed-length flattened representation in the order declared by
Self::feature_names. Suitable as input to any nearest-neighbor
or regression meta-model. category_separation_ratio is
intentionally excluded because it’s a derived ratio of two
features already in the vector — keeping it in would double-count.
Sourcepub fn extract(
categories: &[String],
embeddings: &[Vec<f64>],
) -> Result<Self, String>
pub fn extract( categories: &[String], embeddings: &[Vec<f64>], ) -> Result<Self, String>
Extract features from a corpus using default Laplacian config
(for the active_threshold used in sparsity/noise estimation).
Returns an error if the inputs are invalid (empty corpus, mismatched lengths, zero-dim embeddings, or ragged rows).
Sourcepub fn extract_with_threshold(
categories: &[String],
embeddings: &[Vec<f64>],
active_threshold: f64,
) -> Result<Self, String>
pub fn extract_with_threshold( categories: &[String], embeddings: &[Vec<f64>], active_threshold: f64, ) -> Result<Self, String>
Extract features with an explicit active-axis threshold. Use this when you want feature values comparable across different Laplacian configurations.
Returns an error if the inputs are invalid (empty corpus, mismatched lengths, zero-dim embeddings, or ragged rows).
Trait Implementations§
Source§impl Clone for CorpusFeatures
impl Clone for CorpusFeatures
Source§fn clone(&self) -> CorpusFeatures
fn clone(&self) -> CorpusFeatures
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for CorpusFeatures
impl Debug for CorpusFeatures
Source§impl<'de> Deserialize<'de> for CorpusFeatures
impl<'de> Deserialize<'de> for CorpusFeatures
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Auto Trait Implementations§
impl Freeze for CorpusFeatures
impl RefUnwindSafe for CorpusFeatures
impl Send for CorpusFeatures
impl Sync for CorpusFeatures
impl Unpin for CorpusFeatures
impl UnsafeUnpin for CorpusFeatures
impl UnwindSafe for CorpusFeatures
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> DeserializeOwned for Twhere
T: for<'de> Deserialize<'de>,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more