web-page-classifier 0.1.0

Fast web page type classification using XGBoost with compact binary model
Documentation
  • Coverage
  • 58.82%
    10 out of 17 items documented2 out of 7 items with examples
  • Size
  • Source code size: 1.33 MB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 2.26 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 29s Average build duration of successful builds.
  • all releases: 29s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • Murrough-Foley/web-page-classifier
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • Murrough-Foley

web-page-classifier

Fast web page type classification using an XGBoost model with compact binary format.

Classifies web pages into 7 types: Article, Forum, Product, Collection, Listing, Documentation, Service.

Features

  • Three-stage classification: URL heuristics → HTML signals → ML model
  • Compact embedded model: ~1.1MB XGBoost binary (200 trees, 181 features)
  • Zero dependencies: Pure Rust, no ML frameworks required
  • Fast: Classification in <1ms per page

Quick Start

use web_page_classifier::{classify_url, classify_ml, PageType, N_NUMERIC_FEATURES};

// Stage 1: URL-only classification (fast, no HTML needed)
let page_type = classify_url("https://docs.example.com/api/reference");
assert_eq!(page_type, PageType::Documentation);

// Stage 2: ML classification (higher accuracy, needs extracted features)
let features = vec![0.0f64; N_NUMERIC_FEATURES];
let (page_type, confidence) = classify_ml(&features, "Article about technology");

Model Details

  • Algorithm: XGBoost (200 estimators, max depth 8)
  • Features: 81 numeric (URL patterns, HTML structure, DOM signals) + 100 TF-IDF
  • Training: 1,497 pages across 7 types with SMOTE oversampling
  • Accuracy: 87.3% (macro F1: 0.824)

Note on Binary Size

The embedded model adds ~1.1MB to binary size. This is the cost of shipping a production ML model with zero runtime dependencies.

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.