1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
//! dyer is designed for reliable, flexible and fast web-crawling, providing some high-level, comprehensive  features without compromising speed.
//!
//! By means of event-driven, non-blocking I/O [tokio] and powerful, reliable [Rust programming
//! language], inspired by [scrapy], `dyer` provides some high-level features:  
//!
//! * asynchronous, concurrent streamimg and I/O, make the best of thread pool, network, and system
//! resource.
//! * event-driven, once you set the initials and recursive generator of `Task`, `dyer` will handle
//! the rest of it.
//! * user-friendly, considering the philosophy of rust programming language, more source code,
//! proper architecture may set up yourself in a dilemma when efficiency and learning cost are taken
//! into consideration. `dyer` offers high-level,flexible wrappers and APIs what does a lot for you.    
//!
//! [tokio]: https://docs.rs/tokio
//! [scrapy]: https://scrapy.org
//! [Rust programming language]: https://www.rust-lang.org
//!
//! # Main Functionalities
//!
//! `Dyer` is newly developed rust library, and has achieved some basic functionalities for
//! building a crawer, web service and data processing. Nevertheless, it can tackle most common problems you meet.
//!
//! ## Real-browser Customization
//!
//! For each feeded `Task`, `dyer` will fake a `Profile` and combines them into a `Request` to
//! satisfy the requirement of the target site. By means of powerful `ffi` interface of and web
//! assemble of rust, intergration is not hard.
//!
//! ## Signal Handling
//!
//! Think about a scenario that errors, bugs and unexpected accidents are found when your app is running, what would you
//! do? Stop the app, the entire program and the data are corupted. Nope, the result is not
//! reliable. `dyer` will backup your history between certain gap, resumption is at your choice.
//!
//! ## Run-time Control
//!
//! Each `Task` and each `Profile` is scheduled with some gap, has a time stamp for validation,
//! only the expired can be feeded to engine of `dyer`. Nevertheless `dyer` will limit the
//! `Requests` sent to poll, the `Profile` to make, `Task` to load or store and so on [see `AppArg`].
//!
//! [see `AppArg`]: crate::engine::AppArg
//!
//! # Walk you through an example
//!
//! Take [Scrapy Tutorial] as a guide to know the basics, step by step.
//! * Add `dyer` as dependency in you `Cargo.toml`
//! * Writing a struct and implementing `Spider` Trait, customize parser to extract data and generate recursive `Task`
//! * Writing `PipeLine` and `MiddleWare` to process data if necessary
//!
//! [Scrapy Tutorial]: https://docs.scrapy.org/en/latest/intro/tutorial.html
//!
//! ## Add as Dependency
//!
//! `dyer` is written in rust, normally put the following in `Cargo.toml`, and other libraries
//! needed for further.
//!
//! ```
//! dyer = "*"
//! tokio = { version = "0.2", features = [ "macros", "rt-threaded" ] }
//! futures = "*"
//! serde = {version = "*", features = ["derive"] }
//! serde_json = "*"
//! select = "*"
//! log = "*"
//! simple_logger = "*"
//! ```
//!
//! ## Customization your code in your `src/main.rs`
//!
//! ```
//! extern crate dyer;
//! extern crate futures;
//! extern crate select;
//! extern crate serde;
//! extern crate serde_json;
//! extern crate tokio;
//! extern crate log;
//! extern crate simple_logger;
//!
//! use dyer::{
//!     App, MiddleWare, ParseResult, PipeLine, Profile, Request, ResError, Response, Spider, Task,
//! };
//! use futures::future::{BoxFuture, FutureExt};
//! use serde::{Deserialize, Serialize};
//! use std::fmt::Debug;
//! use std::fs::OpenOptions;
//! use std::io::{LineWriter, Write};
//! use std::sync::{Arc, Mutex, Once};
//! use simple_logger::SimpleLogger;
//!
//! type Stem<U> = Result<U, Box<dyn std::error::Error + Send + Sync>>;
//!
//! // the data to be collected, make sure it is Deserializable and Serializable
//! #[derive(Deserialize, Serialize, Debug, Clone)]
//! pub struct Quote {
//!     pub text: String,
//!     pub author: String,
//!     pub tags: Vec<String>,
//! }
//!
//! // use `Items` as a container of all possible Item.
//! #[derive(Serialize, Debug, Clone)]
//! pub enum Items {
//!     Quote(Quote),
//! }
//!
//! // uri is enough to get the data, so generic parameter of `Task` and `Profile` are not necessary
//! // leave them as empty for the sake of appearance
//! #[derive(Serialize, Deserialize, Debug, Clone)]
//! pub struct Targ {}
//! #[derive(Serialize, Deserialize, Debug, Clone)]
//! pub struct Parg {}
//!
//! // here `select` is implemented to extract the data embodied in the HTML
//! // for more infomation about how to do that,
//! // "https://github.com/utkarshkukreti/select.rs" is recommanded to explore
//! pub fn parse_quote(res: Response<Targ, Parg>) -> ParseResult<Items, Targ, Parg>
//! {
//!     let mut r = ParseResult {
//!         task: vec![],
//!         profile: vec![res.profile],
//!         req: vec![],
//!         entities: vec![],
//!         yield_err: vec![],
//!     };
//!     if res.content.is_none() {
//!         // for the `Response` with empty content, recycle profile
//!         return r;
//!     }
//!     let mut quotes = Vec::new();
//!     let doc = select::document::Document::from(res.content.as_ref().unwrap().as_str());
//!     for node in doc.find(select::predicate::Class("quote")) {
//!         let text = node
//!             .find(select::predicate::Class("text"))
//!             .next()
//!             .unwrap()
//!             .text();
//!         let author = node
//!             .find(select::predicate::Class("author"))
//!             .next()
//!             .unwrap()
//!             .text();
//!         let tags = node
//!             .find(select::predicate::Class("tag"))
//!             .map(|tag| tag.text())
//!             .collect::<Vec<String>>();
//!         let item = Quote { text, author, tags };
//!         quotes.push(Items::Quote(item));
//!     }
//!     r.entities = quotes;
//!
//!     // follow the next page if exists
//!     let mut next_node = doc.find(select::predicate::Class("next"));
//!     if let Some(nd) = next_node.next() {
//!         // next page exists
//!         let next_url = nd
//!             .find(select::predicate::Name("a"))
//!             .next()
//!             .unwrap()
//!             .attr("href")
//!             .unwrap();
//!         let mut task = Task::<Targ>::default();
//!         task.uri = format!("https://quotes.toscrape.com{}", next_url);
//!         task.parser = "parse_quote".to_string();
//!         r.task.push(task);
//!     }
//!     r
//! }
//!
//! pub struct Spd {}
//! // implementing `Spider` for `Spd`
//! impl Spider<Items, Targ, Parg> for Spd {
//!     // preparation before opening spider
//!     // nothing to do in this case
//!     fn open_spider(&self, _app: &mut App<Items, Targ, Parg>) {}
//!
//!     // preparation before closing spider
//!     // nothing to do in this case
//!     fn close_spider(&self, _app: &mut App<Items, Targ, Parg>) {}
//!
//!     // set up parser that extracts `Quote` from the `Response`
//!     fn get_parser<'a>(
//!         &self,
//!         ind: String,
//!     ) -> Option<&'a dyn Fn(Response<Targ, Parg>) -> ParseResult<Items, Targ, Parg>> {
//!         if &ind == "parse_quote" {
//!             return Some(&parse_quote);
//!         }
//!         None
//!     }
//!
//!     // `Task` to be executed when starting `dyer`. Note that this function must reproduce a
//!     // non-empty vector, if not, the whole program will be left at blank.
//!     fn entry_task(&self) -> Stem<Vec<Task<Targ>>> {
//!         let mut task = Task::default();
//!
//!         // all infomation needed is uri and parser
//!         task.uri = "https://quotes.toscrape.com".to_string();
//!         // parser is indexed by a `String` name, you can check that in the function `get_parser`.
//!         task.parser = "parse_quote".to_string();
//!         Ok(vec![task])
//!     }
//!
//!     // the generator of `Profile`
//!     // `dyer` consume the returned `Request`, generate a `Response` fed to the closure
//!     // to generate a `Profile`
//!     fn entry_profile(
//!         &self,
//!     ) -> (
//!         Request<Targ, Parg>,
//!         Option<
//!             &(dyn Fn(&mut Response<Targ, Parg>) -> BoxFuture<'_, Result<Profile<Parg>, ResError>>
//!                   + Send
//!                   + Sync),
//!         >,
//!     ) {
//!         let mut req = Request::<Targ, Parg>::default();
//!         req.task.uri = "https://quotes.toscrape.com".to_string();
//!         // as the domain suggests this site is specially built for crawling,
//!         // you do not have to pretend as a real device,
//!         // leave the `Profile` processor as empty
//!         (req, None)
//!     }
//! }
//!
//! // open a static file `result.json`
//! async fn open_file(path: &str) -> &'static Option<std::fs::File> {
//!     static INIT: Once = Once::new();
//!     static mut VAL: Option<std::fs::File> = None;
//!     unsafe {
//!         INIT.call_once(|| {
//!             let file = OpenOptions::new()
//!                 .create(true)
//!                 .write(true)
//!                 .append(true)
//!                 .open(path)
//!                 .unwrap();
//!             VAL = Some(file);
//!         });
//!         &VAL
//!     }
//! }
//! // store Items into file
//! async fn store_item(items: &mut Arc<Mutex<Vec<Items>>>) {
//!     let mut ser_items = Vec::new();
//!     let items_len = items.lock().unwrap().len();
//!     for _ in 0..items_len {
//!         let item = items.lock().unwrap().pop().unwrap();
//!         let s = serde_json::to_string(&item).unwrap();
//!         ser_items.push(s);
//!     }
//!     let stream = ser_items.join("\n");
//!     let mut writer = LineWriter::new(open_file("result.json").await.as_ref().unwrap());
//!     writer.write(&stream.as_bytes()).unwrap();
//! }
//!
//! #[tokio::main]
//! async fn main() {
//!     // for the sake of clarification, use simple_logger to display some level-varied infomation
//!     // initialize simple_logger
//!     SimpleLogger::new()
//!     .with_level(log::LevelFilter::Info)
//!     .with_module_level("dyer", log::LevelFilter::Debug)
//!     .init()
//!     .unwrap();
//!     static SPD: Spd = Spd {};
//!     // since the `Quote` collected by parse_quote is complete, `MiddleWare` is not necessary,
//!     let middleware = MiddleWare::<Items, Targ, Parg>::builder().build();
//!     // writing a `PipeLine` to store them
//!     // for short, handle `Items` only
//!     let pipeline = PipeLine::<Items, std::fs::File>::builder()
//!         .process_item(&|items: &mut Arc<Mutex<Vec<Items>>>| store_item(items).boxed_local())
//!         .build();
//!
//!     // construct the app and start the crawler
//!     let mut app: App<Items, Targ, Parg> = App::<Items, Targ, Parg>::new();
//!     // donot use history, such as `Request`, `Profile`, `Task`
//!     app.rt_args.lock().unwrap().skip_history = true;
//!     app.run(&SPD, &middleware, pipeline).await;
//!     Ok(())
//! }
//! ```
//!
//! As you expected, It is Done.
//! The output file `result.json` is in the root directory which contains file `Cargo.toml`, and
//! its content displays
//!
//! ```
//! {"Quote":{"text":"“... a mind needs books as a sword needs a whetstone, if it is     to keep its edge.”","author":"George R.R. Martin","tags":["books","mind"]}}
//! ...
//! {"Quote":{"text":"“The world as we have created it is a process of our thinking.     It cannot be changed without changing our thinking.”","author":"Albert Einstein",    "tags":["change","deep-thoughts","thinking","world"]}}
//! ```
//!
//! # Features to be added
//!
//! * proxy support
//! * debugging support(not bad though for now)
//! * more signal support(Ctrl+c for now)
//! * autothrottling and more customized plugins support
//! * more to go

pub mod component;
pub mod engine;
pub mod plugin;

pub use component::*;
pub use engine::{App, AppArg};
pub use plugin::{MiddleWare, PipeLine, Spider};