Content Extractor RL
A high-performance Rust library for extracting article content from HTML pages. It combines a supervised content-node classifier (the "hybrid" selector) with Deep Reinforcement Learning (Dueling DQN with prioritized experience replay; experimental PPO/SAC) that tunes extraction parameters, scored by token F1 against ground-truth article text. A zero-dependency heuristic baseline fallback, site-specific profile memory, and curriculum learning round it out.
Features
- DQN-based extraction — Dueling DQN with prioritized experience replay navigates the DOM tree to select the best content node, observing real per-candidate DOM features (word/link density, tag type, depth, Readability-style class/id signals)
- Hybrid node classification — a supervised content-node classifier selects the article node (labels derived automatically from ground-truth text), while RL tunes the continuous extraction parameters; falls back to a Readability-style heuristic when untrained
- Ground-truth reward — training scores extractions by token F1 against the labelled article text, not a self-referential proxy
- Baseline fallback — stopword-density heuristic runs with zero dependencies on a trained model
- Site profile memory — per-domain XPath patterns learned and reused across sessions
- Curriculum learning — training progresses from simple to complex HTML layouts automatically
- Hyperparameter optimization — grid search and Tree-structured Parzen Estimator (TPE) Bayesian optimization
- Multiple RL algorithms — DuelingDQN (production-ready), PPO and SAC (experimental)
- CUDA acceleration — optional GPU support via the
cudafeature flag - SafeTensors + ONNX serialization — trained models saved in portable formats
- MLflow integration — optional experiment tracking via the
mlflow-rsfeature - Python bindings — PyO3-based bindings for Python consumers (
content-extractor-rl-py) - CLI tool — full-featured
content-extractor-rlbinary for training and extraction
Table of Contents
- Installation
- Quick Start
- Architecture
- CLI Tool
- Training Custom Models
- Pre-trained Models
- API Reference
- Feature Flags
- Performance Notes
- Contributing
- License
Installation
Add to your Cargo.toml:
[]
= "0.1"
With CUDA support:
[]
= { = "0.1", = ["cuda"] }
With MLflow experiment tracking:
[]
= { = "0.1", = ["mlflow-rs"] }
System Requirements
| Requirement | Version |
|---|---|
| Rust | 1.74+ |
| CUDA (optional) | 11.8+ (for cuda feature) |
| Python (optional) | 3.8+ (for Python bindings) |
On Ubuntu/Debian, install HTML parsing dependencies:
Quick Start
Baseline extraction (no trained model required)
use ;
Hybrid extraction (no trained model required)
extract_article is the one-call entry point. With agent = None it uses the
supervised/heuristic hybrid node selector (Readability-style features) plus
block-level filtering — strictly better than the raw baseline, and needs no
model:
use ;
RL-based extraction with a trained model
Pass a loaded agent to extract_article; it runs the agent greedily through the
extraction environment and returns the best result. AgentFactory::load
auto-detects the algorithm (DQN/PPO/SAC) from the model file:
use ;
use Path;
Train a model on your own data
Training rewards extractions by token F1 against the ground-truth article
text. Provide it via TrainingSample::with_ground_truth; if you only have
(html, url) pairs, TrainingSample::from((html, url)) trains against a
self-supervised quality proxy instead.
use ;
use Path;
Train & use the hybrid node classifier
The supervised classifier learns which DOM node is the article body (labels are derived automatically from ground-truth text). Train it, save it, then load it for extraction:
use ;
use Path;
Architecture
content-extractor-rl (workspace)
├── crates/content-extractor-rl ← Rust library (this crate)
│ ├── src/
│ │ ├── lib.rs ← Public API & re-exports
│ │ ├── config.rs ← Configuration & env vars
│ │ ├── baseline_extractor.rs ← Heuristic extraction
│ │ ├── html_parser.rs ← DOM traversal, candidate extraction
│ │ ├── text_utils.rs ← Tokenisation, quality metrics, token F1
│ │ ├── node_features.rs ← Real per-candidate DOM features + extraction params
│ │ ├── node_classifier.rs ← Supervised content-node classifier + HybridExtractor
│ │ ├── environment.rs ← RL environment (real state/action/reward MDP)
│ │ ├── replay_buffer.rs ← Prioritised experience replay
│ │ ├── reward.rs ← Legacy multi-component reward calculator
│ │ ├── curriculum.rs ← Curriculum learning manager
│ │ ├── models.rs ← Dueling DQN network (Candle)
│ │ ├── agents/
│ │ │ ├── mod.rs ← RLAgent trait & AgentFactory
│ │ │ ├── dqn_agent.rs ← Dueling DQN (production-ready)
│ │ │ ├── ppo_agent.rs ← PPO actor-critic (experimental)
│ │ │ └── sac_agent.rs ← SAC twin-Q (experimental)
│ │ ├── training.rs ← Training loops
│ │ ├── hyperparameter.rs ← Grid search
│ │ ├── hyperparameter_tuner.rs ← TPE Bayesian optimisation
│ │ ├── site_profile.rs ← Per-domain pattern memory
│ │ ├── checkpoint.rs ← Save/resume checkpoints
│ │ ├── evaluation/ ← Ground-truth & algorithm comparison
│ │ └── plotting.rs ← Training visualisation
│ └── tests/ ← Integration tests
├── crates/content-extractor-rl-cli ← CLI binary
└── crates/content-extractor-rl-py ← Python bindings (PyO3/Maturin)
RL Environment
| Detail | |
|---|---|
| State space | 300-dimensional float vector — real per-candidate DOM features (word/link density, stopword ratio, tag type, depth, Readability class/id signals) + global document + selection state |
| Action space | 16 discrete actions (select candidate 0-9, navigate parent/siblings, expand/contract, terminate) + 6 continuous parameters (block filtering) |
| Reward | Token F1 of the extracted text vs. the ground-truth article (falls back to a text-quality proxy when no ground truth is supplied) |
| Episode length | Up to max_steps_per_episode (default 20) |
Neural Network
The Dueling DQN network architecture:
Input (300) → FC(512) → LN → ReLU → FC(256) → LN → ReLU → FC(128) → LN → ReLU
│
┌────────────────────────────────────────────┤
│ │
Value stream Advantage stream
FC(64) → FC(1) FC(64) → FC(16)
│ │
└──────── Q(s,a) = V(s) + A(s,a) - mean(A) ┘
│
Continuous params
FC(128) → FC(6) → tanh
CLI Tool
The content-extractor-rl-cli crate installs as the content-extractor-rl binary.
Commands
Extract a single article
--model is optional. Without it, extraction uses the hybrid heuristic node
selector (no model required); with it, the trained RL agent drives selection.
# Hybrid heuristic (no model)
# With a trained RL model (DQN/PPO/SAC auto-detected)
# With a trained node classifier (hybrid selector)
Batch extract from a directory
Train a model
# Standard training (DQN, 5000 episodes)
# Improved training with curriculum learning
# Auto-hyperparameter search before training
Train the node classifier (hybrid selector)
Trains the supervised content-node classifier from HTML + paired ground-truth
JSON, and writes a .safetensors file usable via extract --classifier:
Hyperparameter tuning (TPE Bayesian optimisation)
# Resume an interrupted tuning run
Evaluate extraction quality against ground truth
Compare multiple algorithms
Training Custom Models
Preparing training data
Collect raw HTML pages from the websites you care about. Place them in a flat directory — the filename should contain the domain for site-profile tracking:
training_data/
├── reuters_com_article_001.html
├── reuters_com_article_002.html
├── bbc_co_uk_article_001.html
├── techcrunch_com_post_001.html
└── ...
Recommended minimum: 100 HTML files per domain, 500+ total.
Training from Rust code
use ;
use Path;
Training for specific news/article websites
To customise the model for specific websites, the key levers are:
- Site profiles — the library automatically builds per-domain XPath profiles as it trains. After training, save the site profile directory:
# site_profiles/ now contains per-domain learned patterns
-
Ground-truth reward — when a sample carries ground-truth text, the reward is the token F1 of the extraction against it (
TextUtils::token_f1), so the agent is optimised directly toward the labelled article. Supply it viaTrainingSample::with_ground_truth(the CLI reads it from the JSONtextfield automatically). The continuous action params (min_block_words,max_block_link_density) control block-level filtering and are tuned by the policy. -
Curriculum difficulty — for sites with complex layouts (heavy JavaScript-rendered content, infinite scroll), start with simpler pages. The
CurriculumManagerhandles this automatically when you usetrain_with_improvements. -
Pre-training workflow for a set of target sites:
# Step 1: Tune hyperparameters on a representative sample
# Step 2: Train with the best hyperparameters
# Step 3: Verify quality
Training data format for ground-truth evaluation
To use evaluate and measure accuracy against known-good extractions, provide JSON files alongside HTML:
Pre-trained Models
Three models are included in the models/ directory of this repository, each trained for 10,000 episodes on a corpus of 15,000 HTML pages from diverse news and article domains.
Available ONNX Models
| File | Algorithm | Episodes | Best Quality | File Size | Trained On | Notes |
|---|---|---|---|---|---|---|
DuelingDQN.onnx |
Dueling DQN | 10,000 | 0.8255 | 1.29 MB | CPU | Production-ready, stable training |
PPO.onnx |
PPO (Actor-Critic) | 10,000 | 0.8445 | 1.26 MB | GPU (CUDA) | Experimental; 36 h training run |
SAC.onnx |
SAC (Twin-Q) | 10,000 | 0.8445 | 3.51 MB | CPU | Experimental; see algorithm notes |
All three files are also available in SafeTensors format alongside best-hyperparameter JSON files for each algorithm.
Hyperparameters used for training
| Hyperparameter | DuelingDQN | PPO | SAC |
|---|---|---|---|
learning_rate |
0.002526 | 0.008220 | 0.005867 |
batch_size |
2048 | 512 | 8192 |
gamma |
0.856 | 0.858 | 0.988 |
epsilon_decay |
0.9851 | 0.9859 | 0.9959 |
hidden_layers |
[512, 512, 256, 128] | [512, 512, 256, 128] | [1024, 512, 256] |
layer_norm |
no | yes | yes |
Hyperparameters were found via TPE Bayesian optimisation (content-extractor-rl tune). The full search results are in output/.
Using a pre-trained model
use ;
use Path;
Use DuelingDQN.onnx for all production workloads. The PPO and SAC models are experimental and provided for research comparison only.
Algorithm notes
- DuelingDQN — fully stable training run; no warnings. Best choice for production inference.
- PPO — stable training run on CUDA GPU (36.2 hours). Quality plateaus around episode 7,500.
- SAC — the automatic entropy temperature (
log_alpha) was not receiving gradient updates in earlier code due to a disconnected computation graph (constant tensor vs. theVarleaf). This has been fixed in v0.1.3. The includedSAC.onnxwas trained prior to the fix and should be considered a baseline rather than a tuned model.
Downloading via the CLI
# Download the latest general-purpose DQN model
# List available models
API Reference
Core types
// Main configuration (selected fields)
// Extraction result
// One-call extraction (RL agent when Some, else hybrid/heuristic, baseline fallback)
;
Training
// A training example; ground truth enables token-F1 reward.
// Standard training loop
;
// Training with curriculum learning + ground-truth reward
;
Agent interface
// Create an agent from scratch
;
Hybrid node classifier
use HashSet;
// Supervised content-node classifier (MLP over NodeFeatures).
// Labels for free: 1.0 for the candidate whose text best matches the ground truth.
;
// End-to-end: classifier (or heuristic) picks the node, params drive extraction.
Baseline extractor
use HashSet;
Hyperparameter optimisation
// TPE Bayesian optimisation
Evaluation
Environment variables
| Variable | Default | Description |
|---|---|---|
ARTICLE_EXTRACTOR_MODEL_PATH |
— | Path to a saved model file |
ARTICLE_EXTRACTOR_SITE_PROFILES |
./site_profiles |
Directory for per-domain profiles |
ARTICLE_EXTRACTOR_OUTPUT_DIR |
./output |
Directory for extraction outputs |
ARTICLE_EXTRACTOR_DATA_DIR |
— | Training data directory |
Feature Flags
| Flag | Default | Description |
|---|---|---|
cuda |
off | Enable CUDA GPU acceleration via candle-core/cuda |
mlflow-rs |
off | Enable MLflow experiment tracking |
Algorithm Status
| Algorithm | Status | Notes |
|---|---|---|
DuelingDQN |
Production-ready | Fully tested, checkpoint resume, prioritised replay, gradient clipping |
PPO |
Experimental | Actor-critic structure working; GAE not fully verified |
SAC |
Experimental | Twin-Q + automatic entropy tuning; now has gradient clipping (fixes the earlier NaN divergence under high learning rates) |
TD3 |
Not implemented | Placeholder in AlgorithmType enum |
Rainbow |
Not implemented | Placeholder in AlgorithmType enum |
Use AlgorithmType::DuelingDQN for all production workloads.
Performance Notes
- Baseline extraction runs in < 5 ms per page on any hardware.
- DQN inference (model loaded) runs in 10–30 ms per page on CPU; < 5 ms on GPU.
- Training throughput on CPU: ~200–500 episodes/min depending on HTML complexity.
- Training throughput on A100 GPU: ~2000–5000 episodes/min with
--features cuda. - The replay buffer holds 100,000 experiences by default (adjust
Config::replay_buffer_sizefor memory-constrained environments). rayon-based parallel extraction is available for batch workloads viaextract-batch.
Python Bindings
The content-extractor-rl-py crate compiles to a native Python module named
content_extractor_rl_rs, built with Maturin.
Building / installing the wheel
# Option A — develop install into the current venv (fastest for iterating)
# Option B — build a distributable wheel, then pip install it
# With CUDA support:
Requires a Rust toolchain and Python 3.8+. On aarch64 (e.g. Raspberry Pi 5), the candle dependency needs FP16 SIMD — the repo's
.cargo/config.tomlsetstarget-cpu=nativeto enable it.
Usage
# No model -> hybrid heuristic selection (no training required).
=
# ...or load a trained RL model (DQN/PPO/SAC auto-detected):
=
=
=
# "rl" | "hybrid" | "baseline"
# Batch extraction: list of (html, url) tuples -> {"articles": [...]}
=
# Check hardware
Training from Python
=
# html_samples is a list of (html, url) pairs.
=
=
Note: the Python
trainaccepts(html, url)pairs and trains against the self-supervised quality proxy. For token-F1-against-ground-truth training, use the Rust API (TrainingSample::with_ground_truth) or the CLItraincommand, which reads the ground-truthtextfrom the paired JSON files.
Contributing
Contributions are welcome. Areas where help is most needed:
- Completing PPO and SAC agent training loops
- Expanding the test suite (especially for
environment.rs) - Ground-truth datasets for news domains
- ONNX export improvements
Please open an issue before submitting a large PR.
License
Licensed under either of:
at your option.