Expand description
§WebSites Parser
This website parser library allows asynchronous fetching and extracting data from web pages in multiple formats.
§Key features include:
- Reading an HTML document from a given URL with a randomized user agent (User::random()).
- Selecting elements via CSS selectors and retrieving their attributes and contents.
- Fetching the entire page as plain text.
- Fetching and parsing page content as JSON, with integration for handling it via serde_json.
This tool is well-suited for web scraping and data extraction tasks, supporting flexible parsing of HTML, plain text, and JSON, thereby enabling comprehensive data retrieval from various web sources.
§Examples:
use web_parser::{ prelude::*, User, Document };
#[tokio::main]
async fn main() -> Result<()> {
// _____ READ PAGE AS HTML DOCUMENT: _____
// read website page:
let mut doc = Document::read("https://example.com/", User::random()).await?;
// select 'lang' attribute:
let html = doc.select("html")?.expect("No elements found");
let lang = html.attr("lang").unwrap_or("en");
println!("Language: {lang}");
// select title:
let title = doc.select("h1")?.expect("No elements found");
println!("Title: '{}'", title.text());
// select descriptions:
let mut descrs = doc.select_all("p")?.expect("No elements found");
while let Some(descr) = descrs.next() {
println!("Description: '{}'", descr.text())
}
// _____ READ PAGE AS SIMPLE TEXT: _______
let text: String = Document::text("https://example.com/", User::random()).await?;
println!("Text: {text}");
// _____ READ PAGE AS JSON: ______________
let json: serde_json::Value = Document::json("https://example.com/", User::random()).await?.expect("Failed to parse JSON");
println!("Json: {json}");
Ok(())
}
§Licensing:
Distributed under the MIT license.
§Feedback:
You can contact me via GitHub or send a message to my Telegram @fuderis.
This library is constantly evolving, and I welcome your suggestions and feedback.
Re-exports§
pub use error::Result;
pub use error::Error;
pub use parser::User;
pub use parser::Document;
pub use parser::Node;
pub use parser::Nodes;