Expand description
§Uninews - Universal News Scraper
A powerful Rust library for scraping news articles from various websites and converting them to Markdown format using AI.
§Features
- Intelligent HTML Parsing: Extracts article content from complex HTML structures
- Smart Content Cleaning: Automatically removes ads, scripts, navigation, and other noise
- AI-Powered Formatting: Converts raw HTML to near-lossless Markdown using OpenAI’s GPT models
- Metadata Extraction: Captures title, author, publication date, and featured images
- Multilingual Support: Translates content to any language during processing
- Async/Await: Built with Tokio for efficient async operations
§Quick Start
use uninews::universal_scrape;
#[tokio::main]
async fn main() {
// Make sure OPEN_AI_SECRET environment variable is set
let post = universal_scrape(
"https://example.com/article",
"english",
None
).await;
if post.error.is_empty() {
println!("Title: {}", post.title);
println!("Author: {:?}", post.author);
println!("Published: {:?}", post.publication_date);
println!("\n{}", post.content); // Already formatted in Markdown
} else {
eprintln!("Error: {}", post.error);
}
}§Requirements
- Set the
OPEN_AI_SECRETenvironment variable with your OpenAI API key - The website must provide proper HTML structure and meta tags for best results
§Supported Metadata
The scraper automatically extracts:
- Title: From
<title>tag orog:titlemeta tag - Featured Image: From
og:imagemeta property - Publication Date: From
article:published_timemeta property - Author: From
authormeta tag
§Content Extraction Strategy
The library uses a multi-step approach:
- Downloads HTML content from the provided URL
- Attempts to locate main content in
<article>tags (priority) or<body>fallback - Removes 17 types of unwanted elements (scripts, styles, ads, navigation, etc.)
- Cleans empty nodes and whitespace
- Converts remaining HTML to Markdown using AI while preserving article wording and structure
- Optionally translates to the requested language
§Error Handling
Errors are non-fatal and returned in the Post::error field. Always check this field:
let post = universal_scrape("https://invalid-url-example", "english", None).await;
if !post.error.is_empty() {
match post.error.as_str() {
e if e.contains("Failed to fetch") => println!("Network error"),
e if e.contains("Could not extract meaningful content") => println!("Page structure not supported"),
e if e.contains("LLM Error") => println!("AI processing error"),
e => println!("Unknown error: {}", e),
}
}Structs§
- Post
- Represents a scraped news post with all extracted metadata.
Functions§
- convert_
content_ to_ markdown - Converts raw HTML content to Markdown using OpenAI’s GPT models.
- universal_
scrape - The main API function - scrapes a URL and returns structured article data.