Expand description
Supervised content-node classifier (the “hybrid” half of the system).
Content extraction is, at heart, a supervised node-classification problem: given the candidate DOM nodes of a page and labelled ground-truth article text, learn which candidate is the article body. This is far more sample-efficient and stable than asking RL to discover node selection from a sparse reward. The division of labour is therefore:
- this classifier picks which node is the content root (supervised);
- the RL policy tunes the continuous extraction params within it.
Labels come for free from the data: the candidate whose extracted text has
the highest token-F1 against the ground-truth article is the positive
example, the rest are negatives (label_from_f1).
Structs§
- Hybrid
Extraction - Result of a hybrid extraction.
- Hybrid
Extractor - End-to-end hybrid extractor: the classifier (supervised) picks the content
node, then the RL-tuned
ExtractionParamsdrive block-level extraction within it. When no trained classifier is supplied it falls back to the Readability-styleNodeFeatures::heuristic_content_score, so it is useful even before any training has happened. - Node
Classifier - A small MLP that maps
NodeFeaturesto a content-probability.
Functions§
- build_
classifier_ dataset - Build a pointwise training set for the classifier from labelled samples.
- label_
from_ f1 - Derive the supervised label vector for one page: 1.0 for the candidate whose extracted text best matches the ground truth (token F1), 0.0 for the rest.
- train_
classifier - Train a
NodeClassifieron labelled samples forepochsfull-batch steps.