content-extractor-rl-cli 1.0.0

RL-based article extraction from HTML using Deep Q-Networks and heuristic fallback
content-extractor-rl-cli-1.0.0 is not a library.

Content Extractor RL

Crates.io docs.rs License: MIT OR Apache-2.0 CI

A high-performance Rust library for extracting article content from HTML pages. It combines a supervised content-node classifier (the "hybrid" selector) with Deep Reinforcement Learning (Dueling DQN with prioritized experience replay; experimental PPO/SAC) that tunes extraction parameters, scored by token F1 against ground-truth article text. A zero-dependency heuristic baseline fallback, site-specific profile memory, and curriculum learning round it out.

Features

  • DQN-based extraction — Dueling DQN with prioritized experience replay navigates the DOM tree to select the best content node, observing real per-candidate DOM features (word/link density, tag type, depth, Readability-style class/id signals)
  • Hybrid node classification — a supervised content-node classifier selects the article node (labels derived automatically from ground-truth text), while RL tunes the continuous extraction parameters; falls back to a Readability-style heuristic when untrained
  • Ground-truth reward — training scores extractions by token F1 against the labelled article text, not a self-referential proxy
  • Baseline fallback — stopword-density heuristic runs with zero dependencies on a trained model
  • Site profile memory — per-domain XPath patterns learned and reused across sessions
  • Curriculum learning — training progresses from simple to complex HTML layouts automatically
  • Hyperparameter optimization — grid search and Tree-structured Parzen Estimator (TPE) Bayesian optimization
  • Multiple RL algorithms — DuelingDQN (production-ready), PPO and SAC (experimental)
  • CUDA acceleration — optional GPU support via the cuda feature flag
  • SafeTensors + ONNX serialization — trained models saved in portable formats
  • MLflow integration — optional experiment tracking via the mlflow-rs feature
  • Python bindings — PyO3-based bindings for Python consumers (content-extractor-rl-py)
  • CLI tool — full-featured content-extractor-rl binary for training and extraction

Table of Contents


Installation

Add to your Cargo.toml:

[dependencies]
content-extractor-rl = "0.1"

With CUDA support:

[dependencies]
content-extractor-rl = { version = "0.1", features = ["cuda"] }

With MLflow experiment tracking:

[dependencies]
content-extractor-rl = { version = "0.1", features = ["mlflow-rs"] }

System Requirements

Requirement Version
Rust 1.74+
CUDA (optional) 11.8+ (for cuda feature)
Python (optional) 3.8+ (for Python bindings)

On Ubuntu/Debian, install HTML parsing dependencies:

sudo apt-get install libssl-dev pkg-config

Quick Start

Baseline extraction (no trained model required)

use content_extractor_rl::{Config, BaselineExtractor, Result};

fn main() -> Result<()> {
    let config = Config::default();
    let extractor = BaselineExtractor::new(config.stopwords.clone());

    let html = std::fs::read_to_string("article.html")?;
    let article = extractor.extract(&html)?;

    println!("Title:   {}", article.title.unwrap_or_default());
    println!("Quality: {:.3}", article.quality_score);
    println!("Content: {}", &article.content[..200]);
    Ok(())
}

Hybrid extraction (no trained model required)

extract_article is the one-call entry point. With agent = None it uses the supervised/heuristic hybrid node selector (Readability-style features) plus block-level filtering — strictly better than the raw baseline, and needs no model:

use content_extractor_rl::{Config, extract_article, Result};

fn main() -> Result<()> {
    let config = Config::default();
    let html = std::fs::read_to_string("article.html")?;

    let article = extract_article(&html, "https://example.com/article", &config, None)?;

    println!("Method:  {}", article.method);          // "hybrid" (or "baseline" fallback)
    println!("Title:   {}", article.title.unwrap_or_default());
    println!("Quality: {:.3}", article.quality_score);
    println!("Content: {}", article.content);
    Ok(())
}

RL-based extraction with a trained model

Pass a loaded agent to extract_article; it runs the agent greedily through the extraction environment and returns the best result. AgentFactory::load auto-detects the algorithm (DQN/PPO/SAC) from the model file:

use content_extractor_rl::{Config, AgentFactory, extract_article, get_device, Result};
use std::path::Path;

fn extract_with_model(html: &str, url: &str, model_path: &str) -> Result<String> {
    let config = Config::default();
    let device = get_device(); // CPU, or CUDA when built with --features cuda

    let agent = AgentFactory::load(
        Path::new(model_path),
        config.state_dim,
        config.num_discrete_actions,
        config.num_continuous_params,
        &device,
    )?;

    let article = extract_article(html, url, &config, Some(agent.as_ref()))?;
    Ok(article.content)
}

Train a model on your own data

Training rewards extractions by token F1 against the ground-truth article text. Provide it via TrainingSample::with_ground_truth; if you only have (html, url) pairs, TrainingSample::from((html, url)) trains against a self-supervised quality proxy instead.

use content_extractor_rl::{Config, TrainingSample, train_with_improvements, Result};
use std::path::Path;

fn main() -> Result<()> {
    let config = Config::default();

    let samples: Vec<TrainingSample> = vec![
        TrainingSample::with_ground_truth(
            std::fs::read_to_string("page1.html")?,
            "https://example.com/1".to_string(),
            "The known-good article body text…".to_string(),
        ),
        // …or without ground truth:
        TrainingSample::from((
            std::fs::read_to_string("page2.html")?,
            "https://example.com/2".to_string(),
        )),
    ];

    let (agent, metrics) = train_with_improvements(&config, samples)?;

    println!("Episodes:     {}", metrics.episode_rewards.len());
    println!("Best quality: {:.3}", metrics.best_avg_quality);

    agent.save(Path::new("models/my_model.safetensors"))?;
    Ok(())
}

Train & use the hybrid node classifier

The supervised classifier learns which DOM node is the article body (labels are derived automatically from ground-truth text). Train it, save it, then load it for extraction:

use content_extractor_rl::{
    Config, TrainingSample, train_classifier, NodeClassifier, HybridExtractor,
    extract_article_hybrid, get_device, Result,
};
use std::path::Path;

fn main() -> Result<()> {
    let config = Config::default();
    let device = get_device();

    // Samples must carry ground-truth text for the classifier to learn from.
    let samples: Vec<TrainingSample> = load_labelled_samples();

    let (classifier, loss) = train_classifier(&samples, &config, 300, 1e-2, &device)?;
    println!("Final BCE loss: {loss:.4}");
    classifier.save(Path::new("models/node_classifier.safetensors"))?;

    // …later, load and extract:
    let clf = NodeClassifier::load(Path::new("models/node_classifier.safetensors"), &device, 1e-3)?;
    let hybrid = HybridExtractor::with_classifier(clf, config.stopwords.clone());
    let html = std::fs::read_to_string("article.html")?;
    let article = extract_article_hybrid(&html, "https://example.com/post", &config, &hybrid)?;
    println!("{}", article.content);
    Ok(())
}

Architecture

content-extractor-rl (workspace)
├── crates/content-extractor-rl        ← Rust library (this crate)
│   ├── src/
│   │   ├── lib.rs                  ← Public API & re-exports
│   │   ├── config.rs               ← Configuration & env vars
│   │   ├── baseline_extractor.rs   ← Heuristic extraction
│   │   ├── html_parser.rs          ← DOM traversal, candidate extraction
│   │   ├── text_utils.rs           ← Tokenisation, quality metrics, token F1
│   │   ├── node_features.rs        ← Real per-candidate DOM features + extraction params
│   │   ├── node_classifier.rs      ← Supervised content-node classifier + HybridExtractor
│   │   ├── environment.rs          ← RL environment (real state/action/reward MDP)
│   │   ├── replay_buffer.rs        ← Prioritised experience replay
│   │   ├── reward.rs               ← Legacy multi-component reward calculator
│   │   ├── curriculum.rs           ← Curriculum learning manager
│   │   ├── models.rs               ← Dueling DQN network (Candle)
│   │   ├── agents/
│   │   │   ├── mod.rs              ← RLAgent trait & AgentFactory
│   │   │   ├── dqn_agent.rs        ← Dueling DQN (production-ready)
│   │   │   ├── ppo_agent.rs        ← PPO actor-critic (experimental)
│   │   │   └── sac_agent.rs        ← SAC twin-Q (experimental)
│   │   ├── training.rs             ← Training loops
│   │   ├── hyperparameter.rs       ← Grid search
│   │   ├── hyperparameter_tuner.rs ← TPE Bayesian optimisation
│   │   ├── site_profile.rs         ← Per-domain pattern memory
│   │   ├── checkpoint.rs           ← Save/resume checkpoints
│   │   ├── evaluation/             ← Ground-truth & algorithm comparison
│   │   └── plotting.rs             ← Training visualisation
│   └── tests/                      ← Integration tests
├── crates/content-extractor-rl-cli    ← CLI binary
└── crates/content-extractor-rl-py     ← Python bindings (PyO3/Maturin)

RL Environment

Detail
State space 300-dimensional float vector — real per-candidate DOM features (word/link density, stopword ratio, tag type, depth, Readability class/id signals) + global document + selection state
Action space 16 discrete actions (select candidate 0-9, navigate parent/siblings, expand/contract, terminate) + 6 continuous parameters (block filtering)
Reward Token F1 of the extracted text vs. the ground-truth article (falls back to a text-quality proxy when no ground truth is supplied)
Episode length Up to max_steps_per_episode (default 20)

Neural Network

The Dueling DQN network architecture:

Input (300) → FC(512) → LN → ReLU → FC(256) → LN → ReLU → FC(128) → LN → ReLU
                                                                           │
                              ┌────────────────────────────────────────────┤
                              │                                            │
                       Value stream                               Advantage stream
                       FC(64) → FC(1)                            FC(64) → FC(16)
                              │                                            │
                              └──────── Q(s,a) = V(s) + A(s,a) - mean(A) ┘
                                                         │
                                               Continuous params
                                               FC(128) → FC(6) → tanh

CLI Tool

The content-extractor-rl-cli crate installs as the content-extractor-rl binary.

cargo install content-extractor-rl-cli

Commands

Extract a single article

--model is optional. Without it, extraction uses the hybrid heuristic node selector (no model required); with it, the trained RL agent drives selection.

# Hybrid heuristic (no model)
content-extractor-rl extract \
    --html-file article.html \
    --url https://example.com/article \
    --output result.json

# With a trained RL model (DQN/PPO/SAC auto-detected)
content-extractor-rl extract \
    --html-file article.html \
    --url https://example.com/article \
    --model models/DuelingDQN.safetensors \
    --output result.json

# With a trained node classifier (hybrid selector)
content-extractor-rl extract \
    --html-file article.html \
    --url https://example.com/article \
    --classifier models/node_classifier.safetensors \
    --output result.json

Batch extract from a directory

content-extractor-rl extract-batch \
    --archive-dir ./html_archive \
    --model models/dqn_model.safetensors \
    --output-dir ./extracted \
    --max-files 1000

Train a model

# Standard training (DQN, 5000 episodes)
content-extractor-rl train \
    --data-dir ./training_html \
    --episodes 5000 \
    --algorithm dqn

# Improved training with curriculum learning
content-extractor-rl train \
    --data-dir ./training_html \
    --episodes 10000 \
    --improved \
    --algorithm dqn \
    --models-dir ./models

# Auto-hyperparameter search before training
content-extractor-rl train \
    --data-dir ./training_html \
    --episodes 10000 \
    --improved \
    --auto-hyperparams

Train the node classifier (hybrid selector)

Trains the supervised content-node classifier from HTML + paired ground-truth JSON, and writes a .safetensors file usable via extract --classifier:

content-extractor-rl train-classifier \
    --data-dir ./training_html \
    --output models/node_classifier.safetensors \
    --epochs 300 \
    --learning-rate 0.01

Hyperparameter tuning (TPE Bayesian optimisation)

content-extractor-rl tune \
    --data-dir ./training_html \
    --trials 50 \
    --episodes-per-trial 500 \
    --algorithm dqn \
    --output-dir ./tuning_results

# Resume an interrupted tuning run
content-extractor-rl tune \
    --data-dir ./training_html \
    --trials 50 \
    --resume \
    --output-dir ./tuning_results

Evaluate extraction quality against ground truth

content-extractor-rl evaluate \
    --data-dir ./ground_truth_json \
    --model models/dqn_model.safetensors

Compare multiple algorithms

content-extractor-rl compare \
    --data-dir ./test_html \
    --algorithms dqn,ppo,sac

Training Custom Models

Preparing training data

Collect raw HTML pages from the websites you care about. Place them in a flat directory — the filename should contain the domain for site-profile tracking:

training_data/
├── reuters_com_article_001.html
├── reuters_com_article_002.html
├── bbc_co_uk_article_001.html
├── techcrunch_com_post_001.html
└── ...

Recommended minimum: 100 HTML files per domain, 500+ total.

Training from Rust code

use content_extractor_rl::{Config, TrainingSample, train_with_improvements, Result};
use std::path::Path;

fn main() -> Result<()> {
    let mut config = Config::default();
    // Increase batch size for better stability
    config.batch_size = 1024;
    config.learning_rate = 3e-4;   // f64; ~3e-4 is a safe default
    config.gamma = 0.95;

    // load_html_dir returns Vec<TrainingSample> (read the JSON `text` field for
    // ground truth, see "Training data format" below).
    let samples: Vec<TrainingSample> = load_html_dir("./training_data")?;

    let (agent, metrics) = train_with_improvements(&config, samples)?;
    println!("Training complete. Best quality: {:.3}", metrics.best_avg_quality);

    // Save model
    let model_path = Path::new("models/my_model.safetensors");
    agent.save(model_path)?;

    // Save with full metadata
    agent.save_with_metadata(
        model_path,
        config.num_episodes,
        std::collections::HashMap::from([
            ("learning_rate".to_string(), config.learning_rate),
            ("batch_size".to_string(), config.batch_size as f64),
        ])
    )?;
    Ok(())
}

Training for specific news/article websites

To customise the model for specific websites, the key levers are:

  1. Site profiles — the library automatically builds per-domain XPath profiles as it trains. After training, save the site profile directory:
export ARTICLE_EXTRACTOR_SITE_PROFILES=./site_profiles
content-extractor-rl train --data-dir ./training_data --episodes 5000 --improved
# site_profiles/ now contains per-domain learned patterns
  1. Ground-truth reward — when a sample carries ground-truth text, the reward is the token F1 of the extraction against it (TextUtils::token_f1), so the agent is optimised directly toward the labelled article. Supply it via TrainingSample::with_ground_truth (the CLI reads it from the JSON text field automatically). The continuous action params (min_block_words, max_block_link_density) control block-level filtering and are tuned by the policy.

  2. Curriculum difficulty — for sites with complex layouts (heavy JavaScript-rendered content, infinite scroll), start with simpler pages. The CurriculumManager handles this automatically when you use train_with_improvements.

  3. Pre-training workflow for a set of target sites:

# Step 1: Tune hyperparameters on a representative sample
content-extractor-rl tune \
    --data-dir ./training_sample \
    --trials 30 \
    --episodes-per-trial 300 \
    --output-dir ./tuning

# Step 2: Train with the best hyperparameters
content-extractor-rl train \
    --data-dir ./full_training_data \
    --episodes 15000 \
    --improved \
    --hyperparams ./tuning/best_hyperparams_dqn.json \
    --models-dir ./models

# Step 3: Verify quality
content-extractor-rl evaluate \
    --data-dir ./validation_data \
    --model ./models/best_model.safetensors

Training data format for ground-truth evaluation

To use evaluate and measure accuracy against known-good extractions, provide JSON files alongside HTML:

{
  "url": "https://example.com/article",
  "title": "Article headline here",
  "text": "Full article body text goes here...",
  "author": "Author Name",
  "pubDate": "2024-01-15"
}

Pre-trained Models

Three models are included in the models/ directory of this repository, each trained for 10,000 episodes on a corpus of 15,000 HTML pages from diverse news and article domains.

Available ONNX Models

File Algorithm Episodes Best Quality File Size Trained On Notes
DuelingDQN.onnx Dueling DQN 10,000 0.8255 1.29 MB CPU Production-ready, stable training
PPO.onnx PPO (Actor-Critic) 10,000 0.8445 1.26 MB GPU (CUDA) Experimental; 36 h training run
SAC.onnx SAC (Twin-Q) 10,000 0.8445 3.51 MB CPU Experimental; see algorithm notes

All three files are also available in SafeTensors format alongside best-hyperparameter JSON files for each algorithm.

Hyperparameters used for training

Hyperparameter DuelingDQN PPO SAC
learning_rate 0.002526 0.008220 0.005867
batch_size 2048 512 8192
gamma 0.856 0.858 0.988
epsilon_decay 0.9851 0.9859 0.9959
hidden_layers [512, 512, 256, 128] [512, 512, 256, 128] [1024, 512, 256]
layer_norm no yes yes

Hyperparameters were found via TPE Bayesian optimisation (content-extractor-rl tune). The full search results are in output/.

Using a pre-trained model

use content_extractor_rl::{AgentFactory, Config, extract_article, get_device, Result};
use std::path::Path;

fn main() -> Result<()> {
    let config = Config::default();
    let device = get_device(); // CPU, or CUDA when built with --features cuda

    // Algorithm (DQN/PPO/SAC) is auto-detected from the model metadata.
    let agent = AgentFactory::load(
        Path::new("models/DuelingDQN.safetensors"),
        config.state_dim,
        config.num_discrete_actions,
        config.num_continuous_params,
        &device,
    )?;

    let html = std::fs::read_to_string("page.html")?;
    let article = extract_article(&html, "https://example.com/post", &config, Some(agent.as_ref()))?;
    println!("{}", article.content);
    Ok(())
}

Use DuelingDQN.onnx for all production workloads. The PPO and SAC models are experimental and provided for research comparison only.

Algorithm notes

  • DuelingDQN — fully stable training run; no warnings. Best choice for production inference.
  • PPO — stable training run on CUDA GPU (36.2 hours). Quality plateaus around episode 7,500.
  • SAC — the automatic entropy temperature (log_alpha) was not receiving gradient updates in earlier code due to a disconnected computation graph (constant tensor vs. the Var leaf). This has been fixed in v0.1.3. The included SAC.onnx was trained prior to the fix and should be considered a baseline rather than a tuned model.

Downloading via the CLI

# Download the latest general-purpose DQN model
content-extractor-rl download-model --output models/

# List available models
content-extractor-rl download-model --list

API Reference

Core types

// Main configuration (selected fields)
pub struct Config {
    pub state_dim: usize,              // 300 — state vector dimension
    pub num_discrete_actions: usize,   // 16 — discrete action count
    pub num_continuous_params: usize,  // 6  — continuous parameter count
    pub num_candidate_nodes: usize,    // 10 — candidate DOM nodes scored
    pub learning_rate: f64,            // default: 3e-4
    pub batch_size: usize,             // default: 512
    pub gamma: f64,                    // default: 0.95
    pub epsilon_start: f64,            // default: 1.0
    pub epsilon_end: f64,              // default: 0.05
    pub epsilon_decay: f64,            // default: 0.995
    pub replay_buffer_size: usize,     // default: 100_000
    pub target_update_freq: usize,     // default: 500
    pub max_steps_per_episode: usize,  // default: 20
    pub num_episodes: usize,           // default: 10_000
    // ...
}

// Extraction result
pub struct ExtractedArticle {
    pub url: String,
    pub title: Option<String>,
    pub date: Option<String>,
    pub content: String,
    pub quality_score: f32,
    pub method: String,   // "rl" | "hybrid" | "baseline" (+ "+profile" in batch)
    pub xpath: Option<String>,
}

// One-call extraction (RL agent when Some, else hybrid/heuristic, baseline fallback)
pub fn extract_article(
    html: &str, url: &str, config: &Config, agent: Option<&dyn RLAgent>,
) -> Result<ExtractedArticle>;

Training

// A training example; ground truth enables token-F1 reward.
pub struct TrainingSample { pub html: String, pub url: String, pub ground_truth_text: Option<String> }
impl TrainingSample {
    pub fn with_ground_truth(html: String, url: String, ground_truth_text: String) -> Self;
}
impl From<(String, String)> for TrainingSample { /* ground truth = None */ }

// Standard training loop
pub fn train_standard(
    config: &Config,
    samples: Vec<TrainingSample>,
) -> Result<(Box<dyn RLAgent>, TrainingMetrics)>;

// Training with curriculum learning + ground-truth reward
pub fn train_with_improvements(
    config: &Config,
    samples: Vec<TrainingSample>,
) -> Result<(Box<dyn RLAgent>, TrainingMetrics)>;

pub struct TrainingMetrics {
    pub episode_rewards: Vec<f32>,
    pub episode_qualities: Vec<f32>,   // per-episode token F1 (or quality proxy)
    pub episode_losses: Vec<f32>,
    pub best_avg_quality: f32,
}

Agent interface

pub trait RLAgent: Send + Sync {
    fn select_action(&self, state: &[f32], epsilon: f32) -> Result<(usize, Vec<f32>)>;
    fn train_step(&mut self, replay_buffer: &mut PrioritizedReplayBuffer, batch_size: usize) -> Result<f32>;
    fn update_target_network(&mut self);
    fn save(&self, path: &Path) -> Result<()>;
    fn save_with_metadata(&self, path: &Path, episodes: usize, hyperparams: HashMap<String, f64>) -> Result<()>;
    fn algorithm_type(&self) -> AlgorithmType;
    fn get_info(&self) -> AgentInfo;
}

// Create an agent from scratch
pub struct AgentFactory;
impl AgentFactory {
    pub fn create(algo: AlgorithmType, state_dim: usize, num_actions: usize,
                  num_params: usize, gamma: f32, lr: f64, device: &Device) -> Result<Box<dyn RLAgent>>;
    // Algorithm is auto-detected from the saved model's metadata.
    pub fn load(path: &Path, state_dim: usize, num_actions: usize,
                num_params: usize, device: &Device) -> Result<Box<dyn RLAgent>>;
}

Hybrid node classifier

use std::collections::HashSet;

// Supervised content-node classifier (MLP over NodeFeatures).
pub struct NodeClassifier { /* ... */ }
impl NodeClassifier {
    pub fn new(device: &Device, lr: f64) -> Result<Self>;
    pub fn train_batch(&mut self, features: &[NodeFeatures], labels: &[f32]) -> Result<f32>;
    pub fn score_batch(&self, features: &[NodeFeatures]) -> Result<Vec<f32>>;
    pub fn select_best(&self, features: &[NodeFeatures]) -> Result<Option<usize>>;
}

// Labels for free: 1.0 for the candidate whose text best matches the ground truth.
pub fn label_from_f1(contents: &[CandidateContent], gt: &str, stopwords: &HashSet<String>) -> Option<Vec<f32>>;

// End-to-end: classifier (or heuristic) picks the node, params drive extraction.
pub struct HybridExtractor { /* ... */ }
impl HybridExtractor {
    pub fn heuristic(stopwords: HashSet<String>) -> Self;               // no model
    pub fn with_classifier(clf: NodeClassifier, stopwords: HashSet<String>) -> Self;
    pub fn extract(&self, html: &str, num_candidates: usize, params: &ExtractionParams)
        -> Result<Option<HybridExtraction>>;
}

Baseline extractor

use std::collections::HashSet;

pub struct BaselineExtractor {
    // stopword-density based heuristic, no neural network required
}

impl BaselineExtractor {
    pub fn new(stopwords: HashSet<String>) -> Self;
    pub fn extract(&self, html: &str) -> Result<ExtractionResult>;
}

Hyperparameter optimisation

// TPE Bayesian optimisation
pub struct TPEOptimizer { ... }

impl TPEOptimizer {
    pub fn new(space: HyperparameterSpace) -> Self;
    pub fn optimize(
        &mut self,
        base_config: Config,
        samples: Vec<(String, String)>,
        n_trials: usize,
    ) -> Result<Hyperparameters>;
    pub fn save_state(&self, path: &Path) -> Result<()>;
    pub fn load_state(path: &Path) -> Result<Self>;
}

pub struct HyperparameterSpace {
    pub learning_rate: (f64, f64),       // (min, max)
    pub batch_size: Vec<usize>,
    pub gamma: (f64, f64),
    pub epsilon_decay: (f64, f64),
    pub priority_alpha: (f64, f64),
    pub priority_beta: (f64, f64),
    pub hidden_layer_sizes: Vec<Vec<usize>>,
    pub use_layer_norm: Vec<bool>,
    pub dropout: (f32, f32),
}

Evaluation

pub struct GroundTruthEvaluator { ... }

impl GroundTruthEvaluator {
    pub fn evaluate(&self, extracted: &ExtractedArticle, ground_truth: &GroundTruthData) -> Result<EvaluationMetrics>;
}

pub struct EvaluationMetrics {
    pub text_f1: f32,
    pub text_precision: f32,
    pub text_recall: f32,
    pub title_match: f32,
    pub combined_quality: f32,
}

Environment variables

Variable Default Description
ARTICLE_EXTRACTOR_MODEL_PATH Path to a saved model file
ARTICLE_EXTRACTOR_SITE_PROFILES ./site_profiles Directory for per-domain profiles
ARTICLE_EXTRACTOR_OUTPUT_DIR ./output Directory for extraction outputs
ARTICLE_EXTRACTOR_DATA_DIR Training data directory

Feature Flags

Flag Default Description
cuda off Enable CUDA GPU acceleration via candle-core/cuda
mlflow-rs off Enable MLflow experiment tracking

Algorithm Status

Algorithm Status Notes
DuelingDQN Production-ready Fully tested, checkpoint resume, prioritised replay, gradient clipping
PPO Experimental Actor-critic structure working; GAE not fully verified
SAC Experimental Twin-Q + automatic entropy tuning; now has gradient clipping (fixes the earlier NaN divergence under high learning rates)
TD3 Not implemented Placeholder in AlgorithmType enum
Rainbow Not implemented Placeholder in AlgorithmType enum

Use AlgorithmType::DuelingDQN for all production workloads.


Performance Notes

  • Baseline extraction runs in < 5 ms per page on any hardware.
  • DQN inference (model loaded) runs in 10–30 ms per page on CPU; < 5 ms on GPU.
  • Training throughput on CPU: ~200–500 episodes/min depending on HTML complexity.
  • Training throughput on A100 GPU: ~2000–5000 episodes/min with --features cuda.
  • The replay buffer holds 100,000 experiences by default (adjust Config::replay_buffer_size for memory-constrained environments).
  • rayon-based parallel extraction is available for batch workloads via extract-batch.

Python Bindings

The content-extractor-rl-py crate compiles to a native Python module named content_extractor_rl_rs, built with Maturin.

Building / installing the wheel

cd crates/content-extractor-rl-py

# Option A — develop install into the current venv (fastest for iterating)
maturin develop --release

# Option B — build a distributable wheel, then pip install it
maturin build --release
pip install ../../target/wheels/content_extractor_rl_rs-*.whl

# With CUDA support:
maturin build --release --features cuda

Requires a Rust toolchain and Python 3.8+. On aarch64 (e.g. Raspberry Pi 5), the candle dependency needs FP16 SIMD — the repo's .cargo/config.toml sets target-cpu=native to enable it.

Usage

from content_extractor_rl_rs import RustArticleExtractor

# No model -> hybrid heuristic selection (no training required).
extractor = RustArticleExtractor()

# ...or load a trained RL model (DQN/PPO/SAC auto-detected):
extractor = RustArticleExtractor(model="models/DuelingDQN.safetensors")

html = open("article.html").read()
result = extractor.extract(html, "https://example.com/article")
print(result["method"])         # "rl" | "hybrid" | "baseline"
print(result["title"])
print(result["quality_score"])
print(result["content"])

# Batch extraction: list of (html, url) tuples -> {"articles": [...]}
batch = extractor.extract_batch([(html, "https://example.com/article")])
for art in batch["articles"]:
    print(art["url"], art["quality_score"])

# Check hardware
print("CUDA:", extractor.check_cuda_available())

Training from Python

from content_extractor_rl_rs import RustArticleExtractor

extractor = RustArticleExtractor()

# html_samples is a list of (html, url) pairs.
samples = [(open(p).read(), url) for p, url in my_dataset]

metrics = extractor.train(samples, episodes=5000, improved=True)
print("Best quality:", metrics["best_avg_quality"])

Note: the Python train accepts (html, url) pairs and trains against the self-supervised quality proxy. For token-F1-against-ground-truth training, use the Rust API (TrainingSample::with_ground_truth) or the CLI train command, which reads the ground-truth text from the paired JSON files.


Contributing

Contributions are welcome. Areas where help is most needed:

  • Completing PPO and SAC agent training loops
  • Expanding the test suite (especially for environment.rs)
  • Ground-truth datasets for news domains
  • ONNX export improvements

Please open an issue before submitting a large PR.

git clone https://github.com/sandeepsandhu/content-extractor-rl
cd content-extractor-rl
cargo test --all

License

Licensed under either of:

at your option.