Skip to main content

Crate uninews

Crate uninews 

Source
Expand description

§Uninews - Universal News Scraper

A powerful Rust library for scraping news articles from various websites and converting them to Markdown format using AI.

§Features

  • Intelligent HTML Parsing: Extracts article content from complex HTML structures
  • Smart Content Cleaning: Automatically removes ads, scripts, navigation, and other noise
  • AI-Powered Formatting: Converts raw HTML to near-lossless Markdown using OpenAI’s GPT models
  • Metadata Extraction: Captures title, author, publication date, and featured images
  • Multilingual Support: Translates content to any language during processing
  • Async/Await: Built with Tokio for efficient async operations

§Quick Start

use uninews::universal_scrape;

#[tokio::main]
async fn main() {
    // Make sure OPEN_AI_SECRET environment variable is set
    let post = universal_scrape(
        "https://example.com/article",
        "english",
        None
    ).await;

    if post.error.is_empty() {
        println!("Title: {}", post.title);
        println!("Author: {:?}", post.author);
        println!("Published: {:?}", post.publication_date);
        println!("\n{}", post.content); // Already formatted in Markdown
    } else {
        eprintln!("Error: {}", post.error);
    }
}

§Requirements

  • Set the OPEN_AI_SECRET environment variable with your OpenAI API key
  • The website must provide proper HTML structure and meta tags for best results

§Supported Metadata

The scraper automatically extracts:

  • Title: From <title> tag or og:title meta tag
  • Featured Image: From og:image meta property
  • Publication Date: From article:published_time meta property
  • Author: From author meta tag

§Content Extraction Strategy

The library uses a multi-step approach:

  1. Downloads HTML content from the provided URL
  2. Attempts to locate main content in <article> tags (priority) or <body> fallback
  3. Removes 17 types of unwanted elements (scripts, styles, ads, navigation, etc.)
  4. Cleans empty nodes and whitespace
  5. Converts remaining HTML to Markdown using AI while preserving article wording and structure
  6. Optionally translates to the requested language

§Error Handling

Errors are non-fatal and returned in the Post::error field. Always check this field:

let post = universal_scrape("https://invalid-url-example", "english", None).await;

if !post.error.is_empty() {
    match post.error.as_str() {
        e if e.contains("Failed to fetch") => println!("Network error"),
        e if e.contains("Could not extract meaningful content") => println!("Page structure not supported"),
        e if e.contains("LLM Error") => println!("AI processing error"),
        e => println!("Unknown error: {}", e),
    }
}

Structs§

Post
Represents a scraped news post with all extracted metadata.

Functions§

convert_content_to_markdown
Converts raw HTML content to Markdown using OpenAI’s GPT models.
universal_scrape
The main API function - scrapes a URL and returns structured article data.