web-page-classifier
Fast web page type classification using an XGBoost model with compact binary format.
Classifies web pages into 7 types: Article, Forum, Product, Collection, Listing, Documentation, Service.
Features
- Three-stage classification: URL heuristics → HTML signals → ML model
- Compact embedded model: ~1.1MB XGBoost binary (200 trees, 181 features)
- Zero dependencies: Pure Rust, no ML frameworks required
- Fast: Classification in <1ms per page
Quick Start
use ;
// Stage 1: URL-only classification (fast, no HTML needed)
let page_type = classify_url;
assert_eq!;
// Stage 2: ML classification (higher accuracy, needs extracted features)
let features = vec!;
let = classify_ml;
Model Details
- Algorithm: XGBoost (200 estimators, max depth 8)
- Features: 81 numeric (URL patterns, HTML structure, DOM signals) + 100 TF-IDF
- Training: 1,497 pages across 7 types with SMOTE oversampling
- Accuracy: 87.3% (macro F1: 0.824)
Note on Binary Size
The embedded model adds ~1.1MB to binary size. This is the cost of shipping a production ML model with zero runtime dependencies.
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.