Skip to main content

Crate web_page_classifier

Crate web_page_classifier 

Source
Expand description

Fast web page type classification.

Classifies web pages into 7 types using a compact XGBoost model: Article, Forum, Product, Collection, Listing, Documentation, Service.

§Quick Start

use web_page_classifier::{PageType, classify_url};

let page_type = classify_url("https://docs.example.com/api/reference");
assert_eq!(page_type, PageType::Documentation);

§ML Classification

For higher accuracy, extract numeric features from the HTML DOM and pass them along with title/description text:

use web_page_classifier::{classify_ml, N_NUMERIC_FEATURES};

let features = vec![0.0f64; N_NUMERIC_FEATURES]; // your extracted features
let (page_type, confidence) = classify_ml(&features, "Example Article Title");

Re-exports§

pub use url_heuristics::classify_url;

Modules§

url_heuristics
URL-based page type classification using pattern matching.

Enums§

PageType
Web page type classification.

Constants§

N_NUMERIC_FEATURES
Number of numeric features expected by the ML model.
N_QUALITY_FEATURES
Number of features expected by the quality predictor.

Functions§

classify_ml
Classify a web page using the ML model.
predict_quality
Predict extraction quality (estimated F1 score) from post-extraction features.