Skip to main content

Module corpus_quality

Module corpus_quality 

Source
Expand description

Composite corpus-quality metric.

Combines four sub-scores into one tuner objective, each in [0, 1]:

  1. EVR — variance explained by the projection. Pulled from SphereQLPipeline::explained_variance_ratio. Already in [0, 1].

  2. Bridge coherence — delegates to crate::quality_metric::BridgeCoherence, so the sub-score is bit-identical to the standalone metric, including its neutral-when-no-Genuine floor (BRIDGE_COHERENCE_NEUTRAL). The floor matters here: under BridgeConfig::min_evr_for_classification, low-EVR corpora have zero Genuine bridges, and a raw genuine/total would pin this 0.30-weighted term at 0 — freezing the self-tune objective on exactly the bulk corpora it exists for.

  3. Curvature health — corpus mean of 1 - clamp(|mean_excess_z|, 0, 1) across the per-category curvature signatures returned by curvature_analysis. Categories whose centroids sit close to the corpus-wide spherical-excess regime score near 1; outliers drag the score toward 0.

  4. Category balance — Shannon entropy of category sizes, normalized to [0, 1] against log2(n_categories). Tracks how evenly concepts are distributed across categories.

Default weights (sum = 1):

quality = 0.30 * EVR
        + 0.30 * bridge_coherence
        + 0.20 * curvature_health
        + 0.20 * category_balance

Weights are configurable via CorpusQualityWeights; the metric normalizes by their sum, so they do not need to total 1. The metric is deterministic for a given pipeline.

Structs§

CorpusQuality
Composite metric: a single tuner-friendly score that fuses EVR, bridge coherence, curvature health, and category balance.
CorpusQualityBreakdown
Per-axis sub-scores for one CorpusQuality::score call. Returned via CorpusQuality::last_breakdown so tuner reports and dashboards can attribute the composite to its components.
CorpusQualityWeights
Weights for the four sub-scores. Must be finite, non-negative, and not all zero. They do NOT need to sum to 1 — CorpusQuality normalizes by their sum at score time.