Module node_features

Expand description

Real, content-aware features for candidate DOM nodes.

These replace the placeholder constant state vector that previously made the RL agent blind to the document. Every feature is derived from the actual DOM subtree of a candidate node, so two different candidates produce two different feature vectors — a precondition for the agent to learn anything.

The same features power the supervised node classifier (hybrid mode), so the representation is shared in one place.

Structs§

CandidateContent: Self-contained, owned snapshot of a candidate node’s extractable text.
ExtractionParams: Continuous extraction parameters that the RL policy tunes. They actually affect which text blocks are kept, so the policy’s continuous head has a real effect on the extracted text (and therefore on the reward).
NodeFeatures: Structural / textual features for a single candidate node.
TextBlock: A single block of text (one <p>) with the stats needed to filter it.

Functions§

extract_features: Compute the full feature set for a candidate node.
extract_node_text: Extract article text from a node, honoring the policy’s extraction params.
link_density: Fraction of characters inside <a> descendants relative to total text.
node_content: Build the owned CandidateContent snapshot for a node.
node_text: Concatenate the text content of an element, collapsing whitespace.

Module node_features

Module node_features Copy item path

Structs§

Functions§

Module node_features