Expand description
Real, content-aware features for candidate DOM nodes.
These replace the placeholder constant state vector that previously made the RL agent blind to the document. Every feature is derived from the actual DOM subtree of a candidate node, so two different candidates produce two different feature vectors — a precondition for the agent to learn anything.
The same features power the supervised node classifier (hybrid mode), so the representation is shared in one place.
Structs§
- Candidate
Content - Self-contained, owned snapshot of a candidate node’s extractable text.
- Extraction
Params - Continuous extraction parameters that the RL policy tunes. They actually affect which text blocks are kept, so the policy’s continuous head has a real effect on the extracted text (and therefore on the reward).
- Node
Features - Structural / textual features for a single candidate node.
- Text
Block - A single block of text (one
<p>) with the stats needed to filter it.
Functions§
- extract_
features - Compute the full feature set for a candidate node.
- extract_
node_ text - Extract article text from a node, honoring the policy’s extraction params.
- link_
density - Fraction of characters inside
<a>descendants relative to total text. - node_
content - Build the owned
CandidateContentsnapshot for a node. - node_
text - Concatenate the text content of an element, collapsing whitespace.