Module node_classifier

Expand description

Supervised content-node classifier (the “hybrid” half of the system).

Content extraction is, at heart, a supervised node-classification problem: given the candidate DOM nodes of a page and labelled ground-truth article text, learn which candidate is the article body. This is far more sample-efficient and stable than asking RL to discover node selection from a sparse reward. The division of labour is therefore:

this classifier picks which node is the content root (supervised);
the RL policy tunes the continuous extraction params within it.

Labels come for free from the data: the candidate whose extracted text has the highest token-F1 against the ground-truth article is the positive example, the rest are negatives (label_from_f1).

Structs§

HybridExtraction: Result of a hybrid extraction.
HybridExtractor: End-to-end hybrid extractor: the classifier (supervised) picks the content node, then the RL-tuned ExtractionParams drive block-level extraction within it. When no trained classifier is supplied it falls back to the Readability-style NodeFeatures::heuristic_content_score, so it is useful even before any training has happened.
NodeClassifier: A small MLP that maps NodeFeatures to a content-probability.

Functions§

build_classifier_dataset: Build a pointwise training set for the classifier from labelled samples.
label_from_f1: Derive the supervised label vector for one page: 1.0 for the candidate whose extracted text best matches the ground truth (token F1), 0.0 for the rest.
train_classifier: Train a NodeClassifier on labelled samples for epochs full-batch steps.