Skip to main content

Module node_features

Module node_features 

Source
Expand description

Real, content-aware features for candidate DOM nodes.

These replace the placeholder constant state vector that previously made the RL agent blind to the document. Every feature is derived from the actual DOM subtree of a candidate node, so two different candidates produce two different feature vectors — a precondition for the agent to learn anything.

The same features power the supervised node classifier (hybrid mode), so the representation is shared in one place.

Structs§

CandidateContent
Self-contained, owned snapshot of a candidate node’s extractable text.
ExtractionParams
Continuous extraction parameters that the RL policy tunes. They actually affect which text blocks are kept, so the policy’s continuous head has a real effect on the extracted text (and therefore on the reward).
NodeFeatures
Structural / textual features for a single candidate node.
TextBlock
A single block of text (one <p>) with the stats needed to filter it.

Functions§

extract_features
Compute the full feature set for a candidate node.
extract_node_text
Extract article text from a node, honoring the policy’s extraction params.
link_density
Fraction of characters inside <a> descendants relative to total text.
node_content
Build the owned CandidateContent snapshot for a node.
node_text
Concatenate the text content of an element, collapsing whitespace.