libreadability
Extract the main article content from web pages.
A Rust port of readability by readeck, which is a Go port of Mozilla's Readability.js.
Usage
Add to your Cargo.toml:
[]
= "0.1"
use Parser;
let html = include_str!;
let mut parser = new;
let article = parser.parse.unwrap;
println!;
println!;
println!;
println!;
What it returns
The Article struct contains:
| Field | Description |
|---|---|
title |
Article title |
byline |
Author attribution |
excerpt |
Short description or first paragraph |
content |
Cleaned article HTML |
text_content |
Plain text (via InnerText algorithm) |
length |
Character count of text content |
site_name |
Publisher name |
image |
Lead image URL |
language |
Detected language |
published_time |
Publication timestamp |
modified_time |
Last modified timestamp |
dir |
Text direction (ltr or rtl) |
Configuration
Configure via public fields or chainable builder methods:
use Parser;
// Builder style
let mut parser = new
.with_char_threshold
.with_keep_classes
.with_disable_jsonld;
// Or set fields directly
let mut parser = new;
parser.char_threshold = 200;
parser.keep_classes = true;
Optional features
| Feature | Description |
|---|---|
tracing |
Enable debug/trace logging at key algorithm points (zero-cost when disabled) |
= { = "0.1", = ["tracing"] }
License
MIT