# web-page-classifier
Fast web page type classification using an XGBoost model with compact binary format.
Classifies web pages into 7 types: **Article**, **Forum**, **Product**, **Collection**, **Listing**, **Documentation**, **Service**.
## Features
- **Three-stage classification**: URL heuristics → HTML signals → ML model
- **Compact embedded model**: ~1.1MB XGBoost binary (200 trees, 181 features)
- **Zero dependencies**: Pure Rust, no ML frameworks required
- **Fast**: Classification in <1ms per page
## Quick Start
```rust
use web_page_classifier::{classify_url, classify_ml, PageType, N_NUMERIC_FEATURES};
// Stage 1: URL-only classification (fast, no HTML needed)
let page_type = classify_url("https://docs.example.com/api/reference");
assert_eq!(page_type, PageType::Documentation);
// Stage 2: ML classification (higher accuracy, needs extracted features)
let features = vec![0.0f64; N_NUMERIC_FEATURES];
let (page_type, confidence) = classify_ml(&features, "Article about technology");
```
## Model Details
- **Algorithm**: XGBoost (200 estimators, max depth 8)
- **Features**: 81 numeric (URL patterns, HTML structure, DOM signals) + 100 TF-IDF
- **Training**: 1,497 pages across 7 types with SMOTE oversampling
- **Accuracy**: 87.3% (macro F1: 0.824)
## Note on Binary Size
The embedded model adds ~1.1MB to binary size. This is the cost of shipping a production ML model with zero runtime dependencies.
## License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.