texting_robots/lib.rs
1/*!
2Crate `texting_robots` is a library for parsing `robots.txt` files.
3A key design goal of this crate is to have a thorough test suite tested
4against real world data across millions of sites. While `robots.txt` is a
5simple specification itself the scale and complexity of the web teases out
6every possible edge case.
7
8To read more about the `robots.txt` specification a good starting point is
9[How Google interprets the robots.txt specification][google-spec].
10
11This library cannot guard you against all possible edge cases but should
12give you a strong starting point from which to ensure you and your code
13constitute a positive addition to the internet at large.
14
15[google-spec]: https://developers.google.com/search/docs/advanced/robots/robots_txt
16
17# Installation
18
19You can install the library by adding this entry:
20
21```plain
22[dependencies]
23texting_robots = "0.2"
24```
25
26to your `Cargo.toml` dependency list.
27
28# Overview of usage
29
30This crate provides a simple high level usage through the `Robot` struct.
31
32The `Robot` struct is responsible for consuming the `robots.txt` file,
33processing the contents, and deciding whether a given URL is allow for
34your bot or not. Additional information such as your bot's crawl delay
35and any sitemaps that may exist are also available.
36
37Given the many options and potential preferences Texting Robots does not
38perform caching or a HTTP GET request of the `robots.txt` files themselves.
39This step is up to the user of the library.
40
41```rust
42use texting_robots::{Robot, get_robots_url};
43
44// If you want to fetch a URL we'll find the URL for `robots.txt`
45let url = "https://www.rust-lang.org/learn";
46let robots_url = get_robots_url(url);
47// Then we fetch `robots.txt` from robots_url to parse as below
48
49// A `robots.txt` file in String or byte format.
50let txt = r"User-Agent: FerrisCrawler
51Allow: /ocean
52Disallow: /rust
53Disallow: /forest*.py
54Crawl-Delay: 10
55User-Agent: *
56Disallow: /
57Sitemap: https://www.example.com/site.xml";
58
59// Build the Robot for our friendly User-Agent
60let r = Robot::new("FerrisCrawler", txt.as_bytes()).unwrap();
61
62// Ferris has a crawl delay of one second per limb
63// (Crabs have 10 legs so Ferris must wait 10 seconds!)
64assert_eq!(r.delay, Some(10.0));
65
66// Any listed sitemaps are available for any user agent who finds them
67assert_eq!(r.sitemaps, vec!["https://www.example.com/site.xml"]);
68
69// We can also check which pages Ferris is allowed to crawl
70// Notice we can supply the full URL or a relative path?
71assert_eq!(r.allowed("https://www.rust-lang.org/ocean"), true);
72assert_eq!(r.allowed("/ocean"), true);
73assert_eq!(r.allowed("/ocean/reef.html"), true);
74// Sadly Ferris is allowed in the ocean but not in the rust
75assert_eq!(r.allowed("/rust"), false);
76// Ferris is also friendly but not very good with pythons
77assert_eq!(r.allowed("/forest/tree/snake.py"), false);
78```
79
80# Crawling considerations
81
82## Obtaining `robots.txt`
83
84To obtain `robots.txt` requires performing an initial HTTP GET request to the
85domain in question. When handling the HTTP status codes and how they impact `robots.txt`
86the [suggestions made by Google are recommended][google-spec].
87
88- 2xx (success): Attempt to process the resulting payload
89- 3xx (redirection): Follow a reasonable number of redirects
90- 4xx (client error): Assume there are no crawl restrictions except for:
91 - 429 "Too Many Requests": Retry after a reasonable amount of time
92 (potentially set by the "[Retry-After][mozilla-ra]" header)
93- 5xx (server errors): Assume you should not crawl until fixed and/or interpret with care
94
95Even when directed to "assume no crawl restrictions" it is likely reasonable and
96polite to use a small fetch delay between requests.
97
98### Always set a User Agent
99
100For crawling `robots.txt` (and especially for crawling in general) you should
101include a user agent in your request. Most crawling libraries offer adding the
102user agent in a single line.
103
104```ignore
105ClientBuilder.new().user_agent("FerrisCrawler/0.1 (https://ferris.rust/about-this-robot)")...
106```
107
108Beyond respecting `robots.txt` providing a good user agent provides a line of
109communication between you and the web master.
110
111## Beyond the `robots.txt` specification and general suggestions
112
113`texting_robots` provides much of what you need for safe and respectful
114crawling but is not a full solution by itself.
115
116As an example, the HTTP error code 429 ([Too Many Requests][mozilla-tmr]) must be
117tracked when requesting pages on a given site. When a 429 is seen the crawler
118should slow down, even if obeying the Crawl-Delay set in `robots.txt`, and
119potentially using the delay set by the server's [Retry-After][mozilla-ra] header.
120
121An even more complex example is that multiple domains may back on to the same
122backend web server. This is a common scenario for specific products or services
123that host thousands or millions of domains. How you rate limit fairly using the
124`Crawl-Delay` is entirely up to the end user (and potentially the service when
125using HTTP error code 429 to rate limit traffic).
126
127To protect against adverse input the user of Texting Robots is also suggested to
128follow [Google's recommendations][google-spec] and limit input to 500 kibibytes.
129This is not yet done at the library level in case a larger input may be desired
130but may be revisited depending on community feedback.
131
132[mozilla-tmr]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
133[mozilla-ra]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After
134
135## Usage of Texting Robots in other languages
136
137While not yet specifically supporting any languages other than Rust the
138library was designed to support language integrations in the future. Battle
139testing this intepretation of the `robots.txt` specification against the web is
140easier done testing with friends!
141
142A C API through Rust FFI should be relatively easy to provide given Texting Robots
143only relies on strings, floats, and booleans. The lack of native fetching abilities
144should ensure the library is portable across platforms, situations, and languages.
145
146A proof of concept was performed in [WASI][wasi], the "WebAssembly System Interface",
147showing that the library compiles happily and only experiences a 50% or 75% speed
148penalty when used with the [Wasmer][wasmer] (LLVM backend) and [Wasmtime][wasmtime]
149runtimes respectively. No optimizations have been done thus far and there's likely
150low hanging fruit to reap.
151
152See `wasi_poc.sh` for details.
153
154[wasi]: https://wasi.dev/
155[wasmer]: https://wasmer.io/
156[wasmtime]: https://wasmtime.dev/
157
158# Testing
159
160To run the majority of core tests simply execute `cargo test`.
161
162## Unit and Integration Tests
163
164To check Texting Robot's behaviour against the `robots.txt` specification
165almost all unit tests from [Google's C++ robots.txt parser][google-cpp] and
166[Moz's reppy][moz-reppy] have been translated and included.
167
168Certain aspects of the Google and Moz interpretation disagree with each other.
169When this occurred the author deferred to as much common sense as they
170were able to muster.
171
172For a number of popular domains the `robots.txt` of the given domain was
173saved and tests written against them.
174
175[google-cpp]: https://github.com/google/robotstxt
176[moz-reppy]: https://github.com/seomoz/reppy
177
178## Common Crawl Test Harness
179
180To ensure that the `robots.txt` parser will not panic in real world situations
181over 34 million `robots.txt` responses were passed through Texting Robots.
182While this test doesn't guarantee the `robots.txt` files were handled correctly
183it does ensure the parser is unlikely to panic during practice.
184
185Many problematic, invalid, outrageous, and even adversarial `robots.txt`
186examples were discovered in this process.
187
188For full details see [the Common Crawl testing harness][cc-test].
189
190[cc-test]: https://github.com/Smerity/texting_robots_cc_test
191
192## Fuzz Testing
193
194In the local `fuzz` directory is a fuzz testing harness. The harness is not
195particularly sophisticated but does utilize a low level of structure awareness
196through utilizing [dictionary guided fuzzing][dgf]. The harness has already
197revealed one low level unwrap panic.
198
199To run:
200
201```bash
202cargo fuzz run fuzz_target_1 -- -max_len=512 -dict=keywords.dict
203```
204
205Note:
206
207- `cargo fuzz` requires nightly (i.e. run `rustup default nightly` in the `fuzz` directory)
208- If you have multiple processors you may wish to add `--jobs N` after `cargo run`
209
210[dgf]: https://llvm.org/docs/LibFuzzer.html#dictionaries
211
212## Code Coverage with Tarpaulin
213
214This project uses [Tarpaulin](https://github.com/xd009642/tarpaulin) to perform
215code coverage reporting. Given the relatively small surface area of the parser
216and Robot struct the coverage is high. Unit testing is more important for ensuring
217behavioural correctness however.
218
219To get line numbers for uncovered code run:
220
221```bash
222cargo tarpaulin --ignore-tests -v
223```
224
225*/
226
227use core::fmt;
228
229use bstr::ByteSlice;
230
231use percent_encoding::{utf8_percent_encode, AsciiSet, CONTROLS};
232
233use thiserror::Error;
234use url::{ParseError, Position, Url};
235
236mod minregex;
237use minregex::MinRegex as RobotRegex;
238
239#[cfg(test)]
240mod test;
241
242#[cfg(test)]
243mod test_repcpp;
244
245#[cfg(test)]
246mod test_get_robots_url;
247
248mod parser;
249use crate::parser::{robots_txt_parse, Line};
250
251#[derive(Error, Debug)]
252pub enum Error {
253 /// On any parsing error encountered parsing `robots.txt` this error will
254 /// be returned.
255 ///
256 /// Note: Parsing errors should be rare as the parser is highly forgiving.
257 #[error("Failed to parse robots.txt")]
258 InvalidRobots,
259}
260
261fn percent_encode(input: &str) -> String {
262 // Paths outside ASCII must be percent encoded
263 const FRAGMENT: &AsciiSet =
264 &CONTROLS.add(b' ').add(b'"').add(b'<').add(b'>').add(b'`');
265 utf8_percent_encode(input, FRAGMENT).to_string()
266}
267
268/// Construct the URL for `robots.txt` when given a base URL from the
269/// target domain.
270///
271/// # Errors
272///
273/// If there are any issues in parsing the URL, a [ParseError][pe] from the
274/// [URL crate](url) will be returned.
275///
276/// ```rust
277/// use texting_robots::get_robots_url;
278///
279/// let robots_url = get_robots_url("https://example.com/abc/file.html").unwrap();
280/// assert_eq!(robots_url, "https://example.com/robots.txt");
281/// ```
282///
283/// [pe]: ParseError
284pub fn get_robots_url(url: &str) -> Result<String, ParseError> {
285 let parsed = Url::parse(url);
286 match parsed {
287 Ok(mut url) => {
288 if url.cannot_be_a_base() {
289 return Err(ParseError::SetHostOnCannotBeABaseUrl);
290 }
291
292 if url.scheme() != "http" && url.scheme() != "https" {
293 // EmptyHost isn't optimal but I'd prefer to re-use errors
294 return Err(ParseError::EmptyHost);
295 }
296
297 // Setting username to "" removes the username and password
298 if !url.username().is_empty() {
299 url.set_username("").unwrap();
300 }
301 if url.password().is_some() {
302 url.set_password(None).unwrap();
303 }
304
305 match url.join("/robots.txt") {
306 Ok(robots_url) => Ok(robots_url.to_string()),
307 Err(e) => Err(e),
308 }
309 }
310 Err(e) => Err(e),
311 }
312}
313
314#[allow(dead_code)]
315pub struct Robot {
316 // Rules are stored in the form of (regex rule, allow/disallow)
317 // where the regex rule is ordered by original pattern length
318 rules: Vec<(RobotRegex, bool)>,
319 /// The delay in seconds between requests.
320 /// If `Crawl-Delay` is set in `robots.txt` it will return `Some(f32)`
321 /// and otherwise `None`.
322 pub delay: Option<f32>,
323 /// Any sitemaps found in the `robots.txt` file are added to this vector.
324 /// According to the `robots.txt` specification a sitemap found in `robots.txt`
325 /// is accessible and available to any bot reading `robots.txt`.
326 pub sitemaps: Vec<String>,
327}
328
329impl fmt::Debug for Robot {
330 fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
331 f.debug_struct("Robot")
332 .field("rules", &self.rules)
333 .field("delay", &self.delay)
334 .field("sitemaps", &self.sitemaps)
335 .finish()
336 }
337}
338
339impl Robot {
340 /// Construct a new Robot object specifically processed for the given user agent.
341 /// The user agent extracts all relevant rules from `robots.txt` and stores them
342 /// internally. If the user agent isn't found in `robots.txt` we default to `*`.
343 ///
344 /// Note: The agent string is lowercased before comparison, as required by the
345 /// `robots.txt` specification.
346 ///
347 /// # Errors
348 ///
349 /// If there are difficulties parsing, which should be rare as the parser is quite
350 /// forgiving, then an [InvalidRobots](Error::InvalidRobots) error is returned.
351 pub fn new(agent: &str, txt: &[u8]) -> Result<Self, anyhow::Error> {
352 // Replace '\x00' with '\n'
353 // This shouldn't be necessary but some websites are strange ...
354 let txt = txt
355 .iter()
356 .map(|x| if *x == 0 { b'\n' } else { *x })
357 .collect::<Vec<u8>>();
358
359 // Parse robots.txt using the nom library
360 let lines = match robots_txt_parse(&txt) {
361 Ok((_, lines)) => lines,
362 Err(e) => {
363 let err = anyhow::Error::new(Error::InvalidRobots)
364 .context(e.to_string());
365 return Err(err);
366 }
367 };
368
369 // All agents are case insensitive in `robots.txt`
370 let agent = agent.to_lowercase();
371 let mut agent = agent.as_str();
372
373 // Collect all sitemaps
374 // Why? "The sitemap field isn't tied to any specific user agent and may be followed by all crawlers"
375 let sitemaps = lines
376 .iter()
377 .filter_map(|x| match x {
378 Line::Sitemap(url) => match String::from_utf8(url.to_vec()) {
379 Ok(url) => Some(url),
380 Err(_) => None,
381 },
382 _ => None,
383 })
384 .collect();
385
386 // Filter out any lines that aren't User-Agent, Allow, Disallow, or CrawlDelay
387 // CONFLICT: reppy's "test_robot_grouping_unknown_keys" test suggests these lines should be kept
388 let lines: Vec<Line> = lines
389 .iter()
390 .filter(|x| !matches!(x, Line::Sitemap(_) | Line::Raw(_)))
391 .copied()
392 .collect();
393
394 // Check if our crawler is explicitly referenced, otherwise we're catch all agent ("*")
395 let references_our_bot = lines.iter().any(|x| match x {
396 Line::UserAgent(ua) => {
397 agent.as_bytes() == ua.as_bstr().to_ascii_lowercase()
398 }
399 _ => false,
400 });
401 if !references_our_bot {
402 agent = "*";
403 }
404
405 // Collect only the lines relevant to this user agent
406 // If there are no User-Agent lines then we capture all
407 let mut capturing = false;
408 if lines.iter().filter(|x| matches!(x, Line::UserAgent(_))).count()
409 == 0
410 {
411 capturing = true;
412 }
413 let mut subset = vec![];
414 let mut idx: usize = 0;
415 while idx < lines.len() {
416 let mut line = lines[idx];
417
418 // User-Agents can be given in blocks with rules applicable to all User-Agents in the block
419 // On a new block of User-Agents we're either in it or no longer active
420 if let Line::UserAgent(_) = line {
421 capturing = false;
422 }
423 while idx < lines.len() && matches!(line, Line::UserAgent(_)) {
424 // Unreachable should never trigger as we ensure it's always a UserAgent
425 let ua = match line {
426 Line::UserAgent(ua) => ua.as_bstr(),
427 _ => unreachable!(),
428 };
429 if agent.as_bytes() == ua.as_bstr().to_ascii_lowercase() {
430 capturing = true;
431 }
432 idx += 1;
433 // If it's User-Agent until the end just escape to avoid potential User-Agent capture
434 if idx == lines.len() {
435 break;
436 }
437 line = lines[idx];
438 }
439
440 if capturing {
441 subset.push(line);
442 }
443 idx += 1;
444 }
445
446 // Collect the crawl delay
447 let mut delay = subset
448 .iter()
449 .filter_map(|x| match x {
450 Line::CrawlDelay(Some(d)) => Some(d),
451 _ => None,
452 })
453 .copied()
454 .next();
455
456 // Special note for crawl delay:
457 // Some robots.txt files have it at the top, before any User-Agent lines, to apply to all
458 if delay.is_none() {
459 for line in lines.iter() {
460 if let Line::CrawlDelay(Some(d)) = line {
461 delay = Some(*d);
462 }
463 if let Line::UserAgent(_) = line {
464 break;
465 }
466 }
467 }
468
469 // Prepare the regex patterns for matching rules
470 let mut rules = vec![];
471 for line in subset
472 .iter()
473 .filter(|x| matches!(x, Line::Allow(_) | Line::Disallow(_)))
474 {
475 let (is_allowed, original) = match line {
476 Line::Allow(pat) => (true, *pat),
477 Line::Disallow(pat) => (false, *pat),
478 _ => unreachable!(),
479 };
480 let pat = match original.to_str() {
481 Ok(pat) => pat,
482 Err(_) => continue,
483 };
484
485 // Paths outside ASCII must be percent encoded
486 let pat = percent_encode(pat);
487
488 let rule = RobotRegex::new(&pat);
489
490 let rule = match rule {
491 Ok(rule) => rule,
492 Err(e) => {
493 let err = anyhow::Error::new(e)
494 .context(format!("Invalid robots.txt rule: {}", pat));
495 return Err(err);
496 }
497 };
498 rules.push((rule, is_allowed));
499 }
500
501 Ok(Robot { rules, delay, sitemaps })
502 }
503
504 fn prepare_url(raw_url: &str) -> String {
505 // Try to get only the path + query of the URL
506 if raw_url.is_empty() {
507 return "/".to_string();
508 }
509 // Note: If this fails we assume the passed URL is valid
510 // i.e. We assume the user has passed us a valid relative URL
511 let parsed = Url::parse(raw_url);
512 let url = match parsed.as_ref() {
513 // The Url library performs percent encoding
514 Ok(url) => url[Position::BeforePath..].to_string(),
515 Err(_) => percent_encode(raw_url),
516 };
517 url
518 }
519
520 /// Check if the given URL is allowed for the agent by `robots.txt`.
521 /// This function returns true or false according to the rules in `robots.txt`.
522 ///
523 /// The provided URL can be absolute or relative depending on user preference.
524 ///
525 /// # Example
526 ///
527 /// ```rust
528 /// use texting_robots::Robot;
529 ///
530 /// let r = Robot::new("Ferris", b"Disallow: /secret").unwrap();
531 /// assert_eq!(r.allowed("https://example.com/secret"), false);
532 /// assert_eq!(r.allowed("/secret"), false);
533 /// assert_eq!(r.allowed("/everything-else"), true);
534 /// ```
535 pub fn allowed(&self, url: &str) -> bool {
536 let url = Self::prepare_url(url);
537 if url == "/robots.txt" {
538 return true;
539 }
540
541 // Filter to only rules matching the URL
542 let mut matches: Vec<&_> = self
543 .rules
544 .iter()
545 .filter(|(rule, _)| rule.is_match(&url))
546 .collect();
547
548 // Sort according to the longest match and then by whether it's allowed
549 // RobotRegex is sorted with preference going from longest to shortest
550 // If there are two rules of equal length, allow and disallow, spec says allow
551 matches.sort_by_key(|x| (&x.0, !x.1));
552
553 match matches.first() {
554 Some((_, is_allowed)) => *is_allowed,
555 // If there are no rules we assume we're allowed
556 None => true,
557 }
558 }
559}