Expand description
§Scraping Websites in Rust!
Sitescraper is a libary for the scraping and extraction of website content. You can easily parse html doms and extract data.
See examples below:
§Get InnerHTML:
let html = "<html><body><div>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("body");
println!("{}", filtered_dom.get_inner_html());
//Output: <div>Hello World!</div>
§Get Text:
let html = "<html><body><div>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("body");
println!("{}", filtered_dom.get_text());
//Output: Hello World!
Make sure to enable loop unrolling to avoid possible slow code execution!
§Get Text from single Tags:
use sitescraper;
let html = "<html><body><div>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("div");
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
Works also with
get_inner_html()
§Filter by tag-name, attribute-name and attribute-value using a tuple:
use sitescraper;
let html = "<html><body><div id='hello'>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter(("div", "id", "hello"));
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
Works also with a tuple consisting of two string literals
let filtered_dom = dom.filter(("div", "id"));
You can also filter only by attribute value by writing the following:
use sitescraper;
let html = "<html><body><div id='hello'>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter(("", "", "hello"));
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
** Check out more examples how to use the filter
method **
§Get Website-Content:
use sitescraper;
let html = sitescraper::http::get("http://example.com/).await.unwrap();
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("div");
println!("{}", filtered_dom.get_inner_html());
Modules§
Structs§
- Dom
- A
Dom
is returned when a html-String ist parsed withparse_html
that can be filtered with [filter
] - Tag
- Many
Tag
s are part of aDom
Functions§
- parse_
html - This method parses a &
str
to aDom
. It returns aResult
that can be unwrapped to aDom
if the parsing-process was successful.