Crate web_parser

Crate web_parser 

Source
Expand description

githubcrates-iodocs-rs

This web page parser library allows asynchronous fetching and extracting of data from web pages in multiple formats.

  • Asynchronous web search using the search engines [Google, Bing, Duck, Ecosia, Yahoo, Wiki] with domain blacklisting (feature search).
  • You can also create a custom search engine by using the SearchEngine trait (feature search).
  • Reading an HTML document from a URL with a randomized user-agent (User::random()).
  • Selecting elements by CSS selectors and retrieving their attributes and content.
  • Fetching the full page as plain text.
  • Fetching and parsing page content as JSON with serde_json support.

This tool is well-suited for web scraping and data extraction tasks, offering flexible parsing of HTML, plain text, and JSON to enable comprehensive data gathering from various web sources.

§Examples:

Requires the chromedriver tool installed!

use web_parser::prelude::*;
use macron::path;

#[tokio::main]
async fn main() -> Result<()> {
    // WEB SEARCH:

    let chrome_path = path!("bin/chromedriver/chromedriver.exe");
    let session_path = path!("%/ChromeDriver/WebSearch");
    
    // start search engine:
    let mut engine = SearchEngine::<Duck>::new(
        chrome_path,
        Some(session_path),
        false,
    ).await?;

    println!("Searching results..");

    // send search query:
    let results = engine.search(
        "Rust (programming language)",  // query
        &["support.google.com", "youtube.com"],  // black list
        1000  // sleep in millis
    ).await;
    
    // handle search results:
    match results {
        Ok(cites) => {
            println!("Result cites list: {:#?}", cites.get_urls());

            /*
            println!("Reading result pages..");
            let contents = cites.read(
                5,  // cites count to read
                &[  // tag name black list
                    "header", "footer", "style", "script", "noscript",
                    "iframe", "button", "img", "svg"
                ]
            ).await?;

            println!("Results: {contents:#?}");
            */
        }
        Err(e) => eprintln!("Search error: {e}")
    }

    // stop search engine:
    engine.stop().await?;

    Ok(())
}

§Web Parsing:

use web_parser::prelude::*;

#[tokio::main]
async fn main() -> Result<()> {
    // READ PAGE AS HTML DOCUMENT:
    
    // read website page:
    let mut doc = Document::read("https://example.com/", User::random()).await?;

    // select title:
    let title = doc.select("h1")?.expect("No elements found");
    println!("Title: '{}'", title.text());

    // select descriptions:
    let mut descrs = doc.select_all("p")?.expect("No elements found");
    
    while let Some(descr) = descrs.next() {
        println!("Description: '{}'", descr.text())
    }

    // READ PAGE AS PLAIN TEXT:

    let text: String = Document::text("https://example.com/", User::random()).await?;
    println!("Text: {text}");

    // READ PAGE AS JSON:

    let json: serde_json::Value = Document::json("https://example.com/", User::random()).await?.expect("Failed to parse JSON");
    println!("Json: {json}");

    Ok(())
}

§Licensing:

Distributed under the MIT license.

§Feedback:

You can find me here, also see my channel. I welcome your suggestions and feedback!

Copyright (c) 2025 Bulat Sh. (fuderis)

Re-exports§

pub use error::Result;
pub use error::Error;
pub use document::User;
pub use document::Document;
pub use document::Node;
pub use document::Nodes;

Modules§

document
error
prelude