Content Extractor RL
A high-performance Rust library for extracting article content from HTML pages. Uses Deep Reinforcement Learning (Dueling DQN with prioritized experience replay) with a heuristic baseline fallback, site-specific profile memory, and curriculum learning.
Features
- DQN-based extraction — Dueling DQN with prioritized experience replay navigates the DOM tree to select the best content node
- Baseline fallback — stopword-density heuristic runs with zero dependencies on a trained model
- Site profile memory — per-domain XPath patterns learned and reused across sessions
- Curriculum learning — training progresses from simple to complex HTML layouts automatically
- Hyperparameter optimization — grid search and Tree-structured Parzen Estimator (TPE) Bayesian optimization
- Multiple RL algorithms — DuelingDQN (production-ready), PPO and SAC (experimental)
- CUDA acceleration — optional GPU support via the
cudafeature flag - SafeTensors + ONNX serialization — trained models saved in portable formats
- MLflow integration — optional experiment tracking via the
mlflow-rsfeature - Python bindings — PyO3-based bindings for Python consumers (
content-extractor-rl-py) - CLI tool — full-featured
content-extractor-rlbinary for training and extraction
Table of Contents
- Installation
- Quick Start
- Architecture
- CLI Tool
- Training Custom Models
- Downloading Pre-trained Weights
- API Reference
- Feature Flags
- Performance Notes
- Contributing
- License
Installation
Add to your Cargo.toml:
[]
= "0.1"
With CUDA support:
[]
= { = "0.1", = ["cuda"] }
With MLflow experiment tracking:
[]
= { = "0.1", = ["mlflow-rs"] }
System Requirements
| Requirement | Version |
|---|---|
| Rust | 1.74+ |
| CUDA (optional) | 11.8+ (for cuda feature) |
| Python (optional) | 3.8+ (for Python bindings) |
On Ubuntu/Debian, install HTML parsing dependencies:
Quick Start
Baseline extraction (no trained model required)
use ;
RL-based extraction with a trained model
use ;
use Path;
Train a model on your own data
use ;
Architecture
content-extractor-rl (workspace)
├── crates/content-extractor-rl ← Rust library (this crate)
│ ├── src/
│ │ ├── lib.rs ← Public API & re-exports
│ │ ├── config.rs ← Configuration & env vars
│ │ ├── baseline_extractor.rs ← Heuristic extraction
│ │ ├── html_parser.rs ← DOM traversal, candidate extraction
│ │ ├── text_utils.rs ← Tokenisation, quality metrics
│ │ ├── environment.rs ← RL environment (state/action/reward)
│ │ ├── replay_buffer.rs ← Prioritised experience replay
│ │ ├── reward.rs ← Multi-component reward calculator
│ │ ├── curriculum.rs ← Curriculum learning manager
│ │ ├── models.rs ← Dueling DQN network (Candle)
│ │ ├── agents/
│ │ │ ├── mod.rs ← RLAgent trait & AgentFactory
│ │ │ ├── dqn_agent.rs ← Dueling DQN (production-ready)
│ │ │ ├── ppo_agent.rs ← PPO actor-critic (experimental)
│ │ │ └── sac_agent.rs ← SAC twin-Q (experimental)
│ │ ├── training.rs ← Training loops
│ │ ├── hyperparameter.rs ← Grid search
│ │ ├── hyperparameter_tuner.rs ← TPE Bayesian optimisation
│ │ ├── site_profile.rs ← Per-domain pattern memory
│ │ ├── checkpoint.rs ← Save/resume checkpoints
│ │ ├── evaluation/ ← Ground-truth & algorithm comparison
│ │ └── plotting.rs ← Training visualisation
│ └── tests/ ← Integration tests
├── crates/content-extractor-rl-cli ← CLI binary
└── crates/content-extractor-rl-py ← Python bindings (PyO3/Maturin)
RL Environment
| Detail | |
|---|---|
| State space | 300-dimensional float vector (document features + candidate node features + domain history) |
| Action space | 16 discrete actions (select candidate 0-9, navigate parent/siblings, terminate) + 6 continuous parameters |
| Reward | Multi-component: text quality (50%), length bonus, structure bonus, improvement over baseline |
| Episode length | Up to 50 steps |
Neural Network
The Dueling DQN network architecture:
Input (300) → FC(512) → LN → ReLU → FC(256) → LN → ReLU → FC(128) → LN → ReLU
│
┌────────────────────────────────────────────┤
│ │
Value stream Advantage stream
FC(64) → FC(1) FC(64) → FC(16)
│ │
└──────── Q(s,a) = V(s) + A(s,a) - mean(A) ┘
│
Continuous params
FC(128) → FC(6) → tanh
CLI Tool
The content-extractor-rl-cli crate installs as the content-extractor-rl binary.
Commands
Extract a single article
Batch extract from a directory
Train a model
# Standard training (DQN, 5000 episodes)
# Improved training with curriculum learning
# Auto-hyperparameter search before training
Hyperparameter tuning (TPE Bayesian optimisation)
# Resume an interrupted tuning run
Evaluate extraction quality against ground truth
Compare multiple algorithms
Training Custom Models
Preparing training data
Collect raw HTML pages from the websites you care about. Place them in a flat directory — the filename should contain the domain for site-profile tracking:
training_data/
├── reuters_com_article_001.html
├── reuters_com_article_002.html
├── bbc_co_uk_article_001.html
├── techcrunch_com_post_001.html
└── ...
Recommended minimum: 100 HTML files per domain, 500+ total.
Training from Rust code
use ;
use Path;
Training for specific news/article websites
To customise the model for specific websites, the key levers are:
- Site profiles — the library automatically builds per-domain XPath profiles as it trains. After training, save the site profile directory:
# site_profiles/ now contains per-domain learned patterns
- Reward shaping — the
ImprovedRewardCalculatorscores extractions on text quality. If a site uses non-standard markup, you can adjust quality thresholds in theConfig:
config.min_word_threshold = 50; // minimum words to count as an article
config.stopword_weight = 2.5; // reward stopword-rich paragraphs more
-
Curriculum difficulty — for sites with complex layouts (heavy JavaScript-rendered content, infinite scroll), start with simpler pages. The
CurriculumManagerhandles this automatically when you usetrain_with_improvements. -
Pre-training workflow for a set of target sites:
# Step 1: Tune hyperparameters on a representative sample
# Step 2: Train with the best hyperparameters
# Step 3: Verify quality
Training data format for ground-truth evaluation
To use evaluate and measure accuracy against known-good extractions, provide JSON files alongside HTML:
Downloading Pre-trained Weights
Pre-trained model weights are provided as GitHub Release attachments. These are general-purpose models trained on a diverse corpus of news and blog articles.
Downloading via the CLI
# Download the latest general-purpose DQN model
# List available models
Manual download
Go to GitHub Releases and download:
| File | Description | Size |
|---|---|---|
dqn_general_v1.safetensors |
General news/blog articles | ~15 MB |
dqn_news_v1.safetensors |
Tuned for news sites (Reuters, BBC, etc.) | ~15 MB |
site_profiles_v1.tar.gz |
Site-specific XPath profiles | ~1 MB |
Using a downloaded model
use ;
use Path;
API Reference
Core types
// Main configuration
// Extraction result
Training
// Standard training loop
;
// Training with curriculum learning + improved rewards
;
Agent interface
// Create an agent from scratch
;
Baseline extractor
Hyperparameter optimisation
// TPE Bayesian optimisation
Evaluation
Environment variables
| Variable | Default | Description |
|---|---|---|
ARTICLE_EXTRACTOR_MODEL_PATH |
— | Path to a saved model file |
ARTICLE_EXTRACTOR_SITE_PROFILES |
./site_profiles |
Directory for per-domain profiles |
ARTICLE_EXTRACTOR_OUTPUT_DIR |
./output |
Directory for extraction outputs |
ARTICLE_EXTRACTOR_DATA_DIR |
— | Training data directory |
Feature Flags
| Flag | Default | Description |
|---|---|---|
cuda |
off | Enable CUDA GPU acceleration via candle-core/cuda |
mlflow-rs |
off | Enable MLflow experiment tracking |
Algorithm Status
| Algorithm | Status | Notes |
|---|---|---|
DuelingDQN |
Production-ready | Fully tested, checkpoint resume, prioritised replay |
PPO |
Experimental | Actor-critic structure working; GAE not fully verified |
SAC |
Experimental | Twin-Q networks present; entropy tuning needs testing |
TD3 |
Not implemented | Placeholder in AlgorithmType enum |
Rainbow |
Not implemented | Placeholder in AlgorithmType enum |
Use AlgorithmType::DuelingDQN for all production workloads.
Performance Notes
- Baseline extraction runs in < 5 ms per page on any hardware.
- DQN inference (model loaded) runs in 10–30 ms per page on CPU; < 5 ms on GPU.
- Training throughput on CPU: ~200–500 episodes/min depending on HTML complexity.
- Training throughput on A100 GPU: ~2000–5000 episodes/min with
--features cuda. - The replay buffer holds 100,000 experiences by default (adjust
Config::replay_buffer_sizefor memory-constrained environments). rayon-based parallel extraction is available for batch workloads viaextract-batch.
Python Bindings
The content-extractor-rl-py crate provides a Python package built with Maturin.
# or build from source:
=
=
Contributing
Contributions are welcome. Areas where help is most needed:
- Completing PPO and SAC agent training loops
- Expanding the test suite (especially for
environment.rs) - Ground-truth datasets for news domains
- ONNX export improvements
Please open an issue before submitting a large PR.
License
Licensed under either of:
at your option.