Crate sitescraper

Source
Expand description

§Scraping Websites in Rust!

Sitescraper is a libary for the scraping and extraction of website content. You can easily parse html doms and extract data.

See examples below:

§Get InnerHTML:

let html = "<html><body><div>Hello World!</div></body></html>";
    
let dom = sitescraper::parse_html(html).unwrap();
     
let filtered_dom = dom.filter("body");
      
println!("{}", filtered_dom.get_inner_html());
//Output: <div>Hello World!</div>

§Get Text:

let html = "<html><body><div>Hello World!</div></body></html>";

let dom = sitescraper::parse_html(html).unwrap();
 
let filtered_dom = dom.filter("body");
 
println!("{}", filtered_dom.get_text());
//Output: Hello World!

Make sure to enable loop unrolling to avoid possible slow code execution!

§Get Text from single Tags:

use sitescraper;

let html = "<html><body><div>Hello World!</div></body></html>";
 
let dom = sitescraper::parse_html(html).unwrap();
 
let filtered_dom = dom.filter("div");
 
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!

Works also with

get_inner_html()

§Filter by tag-name, attribute-name and attribute-value using a tuple:

use sitescraper;
 
let html = "<html><body><div id='hello'>Hello World!</div></body></html>";
 
let dom = sitescraper::parse_html(html).unwrap();
 
let filtered_dom = dom.filter(("div", "id", "hello"));
 
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!

Works also with a tuple consisting of two string literals

let filtered_dom = dom.filter(("div", "id"));

You can also filter only by attribute value by writing the following:

use sitescraper;
 
let html = "<html><body><div id='hello'>Hello World!</div></body></html>";
 
let dom = sitescraper::parse_html(html).unwrap();
 
let filtered_dom = dom.filter(("", "", "hello"));
 
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!

** Check out more examples how to use the filter method **

§Get Website-Content:

use sitescraper;
 
let html = sitescraper::http::get("http://example.com/).await.unwrap();
 
let dom = sitescraper::parse_html(html).unwrap();
 
let filtered_dom = dom.filter("div");
 
println!("{}", filtered_dom.get_inner_html());
 

Modules§

http

Structs§

Dom
A Dom is returned when a html-String ist parsed with parse_html that can be filtered with [filter]
Tag
Many Tags are part of a Dom

Functions§

parse_html
This method parses a &str to a Dom. It returns a Result that can be unwrapped to a Dom if the parsing-process was successful.