Expand description
§markdown-harvest
A Rust crate designed to extract, clean, and convert web content from URLs found in text messages into clean Markdown format. Originally created as an auxiliary component for Retrieval-Augmented Generation (RAG) solutions to process URLs submitted by users.
§Overview
This crate provides functionality to:
- Extract URLs from text input
- Fetch web content from those URLs
- Clean and convert HTML content to readable Markdown format
- Remove unwanted elements like navigation, advertisements, and scripts
§Quick Start
use markdown_harvest::{MarkdownHarvester, HttpConfig};
let text = "Check out this article: https://example.com/article";
let config = HttpConfig::default();
let results = MarkdownHarvester::get_hyperlinks_content(text.to_string(), config);
for (url, markdown_content) in results {
println!("URL: {}", url);
println!("Content: {}", markdown_content);
}§Features
- URL Detection: Automatically extracts HTTP/HTTPS URLs from text
- Content Extraction: Fetches and processes web content
- HTML Cleaning: Removes scripts, styles, navigation, and advertisements
- Markdown Conversion: Converts cleaned HTML to readable Markdown
- User Agent Rotation: Uses random user agents to avoid blocking
§Main Components
MarkdownHarvester: The main struct for processing URLs and extracting contentUserAgent: Enum providing various browser user agent strings- Pattern functions: Helper functions that define cleaning patterns for HTML processing
Structs§
- Content
Processor - Component responsible for HTML cleaning and Markdown conversion.
- Http
Client - Component responsible for handling HTTP requests and URL processing.
- Http
Config - Http
Config Builder - Markdown
Harvester - Main struct for extracting and converting web content from URLs to Markdown.
Enums§
- User
Agent - Represents different browser user agent strings for web scraping.
Statics§
Functions§
- additional_
cleanup - Returns regex patterns for final cleanup of HTML elements before Markdown conversion.
- content_
selectors - Returns CSS selectors for identifying main content areas in HTML.
- media_
elements - Returns regex patterns for removing media elements from HTML content.
- text_
selectors - Returns CSS selectors for identifying text content elements in HTML.
- unwanted_
elements - Returns regex patterns for removing unwanted structural elements from HTML.
- unwanted_
text_ patterns - Returns regex patterns for removing unwanted text patterns from content.