1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
//! Tool for scraping structured data from webpages automatically.
//!
//! This project is inspired by the python package [mlscraper](https://github.com/lorey/mlscraper).
//! See README.md for a comparison with the python version and example code.
//!
//! Quick example:
//!
//! ```
//! # use mlscraper_rust::search::AttributeBuilder;
//! let html = reqwest::blocking::get("http://quotes.toscrape.com/author/Albert-Einstein/")
//! .expect("request") // Scrappy error handling for demonstration purposes
//! .text()
//! .expect("text");
//!
//! let result = mlscraper_rust::train(
//! vec![html.as_str()],
//! vec![
//! AttributeBuilder::new("name")
//! .values(&[Some("Albert Einstein")])
//! .build(),
//!
//! AttributeBuilder::new("born")
//! .values(&[Some("March 14, 1879")])
//! .build(),
//! ],
//! Default::default(),
//! 1
//! ).expect("training");
//!
//! // Prints `{"born": .author-born-date, "name": h3}`
//! println!("{:?}", result.selectors());
//! ```
extern crate tl;
use crate*;
use Result;
use SmallRng;
use Rng;
use SeedableRng;
/// Find suitable selectors for `attributes` in HTML documents `documents`.
///
/// The number of `iterations`
/// is the number of generations the fuzzing algorithm should produce.
/// In our experience, a very low number (1-3) of iterations should be
/// sufficient for most input HTML documents. If a document has a very
/// deep, nested structure, a higher number of iterations may be necessary.
///
/// Further settings can be adjusted with [`FuzzerSettings`]. If the generated
/// selectors are not satisfactory, you can experiment with increasing the
/// `random_generation_count`, `random_generation_retries` and other settings.
/// Note that this may impact the training time.
///
/// The returned `TrainingResult` can be used to retrieve the generated
/// selectors or to automatically extract information from previously
/// unseen documents.
/// Same as [`train`], but with a custom random number generator ([`Rng`]).