Crate markdown_harvest

Crate markdown_harvest 

Source
Expand description

§markdown-harvest

A Rust crate designed to extract, clean, and convert web content from URLs found in text messages into clean Markdown format. Originally created as an auxiliary component for Retrieval-Augmented Generation (RAG) solutions to process URLs submitted by users.

§Overview

This crate provides functionality to:

  • Extract URLs from text input
  • Fetch web content from those URLs
  • Clean and convert HTML content to readable Markdown format
  • Remove unwanted elements like navigation, advertisements, and scripts

§Quick Start

use markdown_harvest::{MarkdownHarvester, HttpConfig};

let text = "Check out this article: https://example.com/article";
let config = HttpConfig::default();
let results = MarkdownHarvester::get_hyperlinks_content(text.to_string(), config);

for (url, markdown_content) in results {
    println!("URL: {}", url);
    println!("Content: {}", markdown_content);
}

§Features

  • URL Detection: Automatically extracts HTTP/HTTPS URLs from text
  • Content Extraction: Fetches and processes web content
  • HTML Cleaning: Removes scripts, styles, navigation, and advertisements
  • Markdown Conversion: Converts cleaned HTML to readable Markdown
  • User Agent Rotation: Uses random user agents to avoid blocking

§Main Components

  • MarkdownHarvester: The main struct for processing URLs and extracting content
  • UserAgent: Enum providing various browser user agent strings
  • Pattern functions: Helper functions that define cleaning patterns for HTML processing

Structs§

ContentProcessor
Component responsible for HTML cleaning and Markdown conversion.
HttpClient
Component responsible for handling HTTP requests and URL processing.
HttpConfig
HttpConfigBuilder
MarkdownHarvester
Main struct for extracting and converting web content from URLs to Markdown.

Enums§

UserAgent
Represents different browser user agent strings for web scraping.

Statics§

URL_REGEX

Functions§

additional_cleanup
Returns regex patterns for final cleanup of HTML elements before Markdown conversion.
content_selectors
Returns CSS selectors for identifying main content areas in HTML.
media_elements
Returns regex patterns for removing media elements from HTML content.
text_selectors
Returns CSS selectors for identifying text content elements in HTML.
unwanted_elements
Returns regex patterns for removing unwanted structural elements from HTML.
unwanted_text_patterns
Returns regex patterns for removing unwanted text patterns from content.