Skip to main content

predict_quality

Function predict_quality 

Source
pub fn predict_quality(features: &[f64]) -> f64
Expand description

Predict extraction quality (estimated F1 score) from post-extraction features.

Returns a value in [0.0, 1.0] estimating how well the extraction captured the page’s main content. Low scores (< 0.80) indicate the extraction may be poor and should be routed to an LLM fallback.

§Arguments

  • features - Raw (unscaled) quality features. Must have length N_QUALITY_FEATURES. Features include content statistics, page type indicators, and HTML-level signals.

§Feature order (27 features)

0: heuristic_conf, 1: content_len, 2: word_count, 3: vocab_ratio, 4: avg_word_len, 5: sentence_count, 6: avg_sentence_len, 7: sentence_uniqueness, 8: paragraph_count, 9: avg_paragraph_len, 10: link_count_in_content, 11: link_density, 12: boilerplate_keywords, 13-19: is_article..is_service (one-hot page type), 20: length_ratio, 21: html_size, 22: extraction_ratio, 23: og_overlap, 24: script_count, 25: has_jsonld, 26: top_bigram_freq

§Panics

Panics if features.len() != N_QUALITY_FEATURES.