1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
//! Custom-fetch hook for [`crate::website::Website`].
//!
//! Implementing [`RemoteFetcher`] and installing it on a
//! [`Website`](crate::website::Website) via
//! [`with_remote_fetcher`](crate::website::Website::with_remote_fetcher)
//! reroutes spider's per-URL network round-trip through the user's code
//! while leaving every other crawl concern — visited tracking, depth,
//! allow/deny, robots, link extraction, scheduling, subscription
//! channels — in spider's hands.
//!
//! ## Default behavior unchanged
//!
//! A `Website` with **no** fetcher installed (the default, and the only
//! possibility on existing API users' code) runs the exact same fetch
//! path it always has — the built-in reqwest client, all feature-gated
//! retries / cache / hedge / parallel-backends machinery, everything.
//!
//! The hook is purely additive: it short-circuits *before* the built-in
//! fetch path executes, so when it fires none of those layers run. The
//! `RemoteFetcher` implementation owns those concerns on its own side
//! (gottem's orchestrator, for example, brings its own retry ladder /
//! escalation / hedge across cloud vendors).
//!
//! ## Scope (today)
//!
//! Today the hook fires only in the **HTTP** crawl path
//! ([`Website::crawl`](crate::website::Website::crawl) /
//! [`Website::crawl_raw`](crate::website::Website::crawl_raw)). The
//! chrome / webdriver / smart variants still drive their own
//! browser-backed fetches; setting a fetcher on a chrome-mode Website
//! has no effect there. Extending the hook to those paths is future
//! work — straightforward, but each path has its own machinery and
//! would expand the surface beyond what this addition is designed for.
//!
//! ## Example
//!
//! ```no_run
//! use std::sync::Arc;
//! use spider::fetcher::{FetchContext, RemoteFetcher};
//! use spider::utils::PageResponse;
//! use spider::website::Website;
//!
//! struct MyFetcher;
//!
//! #[async_trait::async_trait]
//! impl RemoteFetcher for MyFetcher {
//! async fn fetch(&self, ctx: FetchContext<'_>) -> PageResponse {
//! // … call your transport (HTTP, gRPC, cloud API, anything) …
//! // Return a PageResponse spider can consume.
//! let mut resp = PageResponse::default();
//! resp.final_url = Some(ctx.url.to_string());
//! resp.content = Some(b"<html>hi</html>".to_vec());
//! resp
//! }
//! }
//!
//! # async fn ex() {
//! let mut site = Website::new("https://example.com");
//! site.with_remote_fetcher(MyFetcher);
//! site.crawl().await; // every per-URL fetch flows through MyFetcher
//! # }
//! ```
use Arc;
use crateConfiguration;
use cratePageResponse;
/// Per-request context handed to a [`RemoteFetcher`]. Borrowed for the
/// duration of one fetch call.
/// User-supplied fetch transport. When installed on a
/// [`Website`](crate::website::Website) via
/// [`with_remote_fetcher`](crate::website::Website::with_remote_fetcher),
/// spider invokes this on every URL that survives the `is_allowed`
/// gate, replacing its built-in reqwest fetch.
///
/// Cancellation: the fetcher implementation is responsible for honoring
/// any cancellation contract its caller provides. Spider does not pass
/// a cancel token into the trait — drop semantics on the future are the
/// signal.
/// Type alias used internally by `Website` to store an installed
/// fetcher. `Arc<dyn ...>` keeps the slot tiny when unset (`None`).
pub type SharedRemoteFetcher = ;