Expand description
Fast web page type classification.
Classifies web pages into 7 types using a compact XGBoost model: Article, Forum, Product, Collection, Listing, Documentation, Service.
§Quick Start
use web_page_classifier::{PageType, classify_url};
let page_type = classify_url("https://docs.example.com/api/reference");
assert_eq!(page_type, PageType::Documentation);§ML Classification
For higher accuracy, extract numeric features from the HTML DOM and pass them along with title/description text:
use web_page_classifier::{classify_ml, N_NUMERIC_FEATURES};
let features = vec![0.0f64; N_NUMERIC_FEATURES]; // your extracted features
let (page_type, confidence) = classify_ml(&features, "Example Article Title");Re-exports§
pub use url_heuristics::classify_url;
Modules§
- url_
heuristics - URL-based page type classification using pattern matching.
Enums§
- Page
Type - Web page type classification.
Constants§
- N_
NUMERIC_ FEATURES - Number of numeric features expected by the ML model.
- N_
QUALITY_ FEATURES - Number of features expected by the quality predictor.
Functions§
- classify_
ml - Classify a web page using the ML model.
- predict_
quality - Predict extraction quality (estimated F1 score) from post-extraction features.