Crate web_parser

Source
Expand description

githubcrates-iodocs-rs

§WebSites Parser

This website parser library allows asynchronous fetching and extracting data from web pages in multiple formats.

§Key features include:

  • Reading an HTML document from a given URL with a randomized user agent (User::random()).
  • Selecting elements via CSS selectors and retrieving their attributes and contents.
  • Fetching the entire page as plain text.
  • Fetching and parsing page content as JSON, with integration for handling it via serde_json.

This tool is well-suited for web scraping and data extraction tasks, supporting flexible parsing of HTML, plain text, and JSON, thereby enabling comprehensive data retrieval from various web sources.

§Examples:

use web_parser::{ prelude::*, User, Document };

#[tokio::main]
async fn main() -> Result<()> {
    // _____ READ PAGE AS HTML DOCUMENT: _____
    
    // read website page:
    let mut doc = Document::read("https://example.com/", User::random()).await?;

    // select 'lang' attribute:
    let html = doc.select("html")?.expect("No elements found");
    let lang = html.attr("lang").unwrap_or("en");
    println!("Language: {lang}");

    // select title:
    let title = doc.select("h1")?.expect("No elements found");
    println!("Title: '{}'", title.text());

    // select descriptions:
    let mut descrs = doc.select_all("p")?.expect("No elements found");
    while let Some(descr) = descrs.next() {
        println!("Description: '{}'", descr.text())
    }


    // _____ READ PAGE AS SIMPLE TEXT: _______

    let text: String = Document::text("https://example.com/", User::random()).await?;
    println!("Text: {text}");


    // _____ READ PAGE AS JSON: ______________

    let json: serde_json::Value = Document::json("https://example.com/", User::random()).await?.expect("Failed to parse JSON");
    println!("Json: {json}");

    Ok(())
}

§Licensing:

Distributed under the MIT license.

§Feedback:

You can contact me via GitHub or send a message to my Telegram @fuderis.

This library is constantly evolving, and I welcome your suggestions and feedback.

Re-exports§

pub use error::Result;
pub use error::Error;
pub use parser::User;
pub use parser::Document;
pub use parser::Node;
pub use parser::Nodes;

Modules§

error
parser
prelude