texting_robots/
lib.rs

1/*!
2Crate `texting_robots` is a library for parsing `robots.txt` files.
3A key design goal of this crate is to have a thorough test suite tested
4against real world data across millions of sites. While `robots.txt` is a
5simple specification itself the scale and complexity of the web teases out
6every possible edge case.
7
8To read more about the `robots.txt` specification a good starting point is
9[How Google interprets the robots.txt specification][google-spec].
10
11This library cannot guard you against all possible edge cases but should
12give you a strong starting point from which to ensure you and your code
13constitute a positive addition to the internet at large.
14
15[google-spec]: https://developers.google.com/search/docs/advanced/robots/robots_txt
16
17# Installation
18
19You can install the library by adding this entry:
20
21```plain
22[dependencies]
23texting_robots = "0.2"
24```
25
26to your `Cargo.toml` dependency list.
27
28# Overview of usage
29
30This crate provides a simple high level usage through the `Robot` struct.
31
32The `Robot` struct is responsible for consuming the `robots.txt` file,
33processing the contents, and deciding whether a given URL is allow for
34your bot or not. Additional information such as your bot's crawl delay
35and any sitemaps that may exist are also available.
36
37Given the many options and potential preferences Texting Robots does not
38perform caching or a HTTP GET request of the `robots.txt` files themselves.
39This step is up to the user of the library.
40
41```rust
42use texting_robots::{Robot, get_robots_url};
43
44// If you want to fetch a URL we'll find the URL for `robots.txt`
45let url = "https://www.rust-lang.org/learn";
46let robots_url = get_robots_url(url);
47// Then we fetch `robots.txt` from robots_url to parse as below
48
49// A `robots.txt` file in String or byte format.
50let txt = r"User-Agent: FerrisCrawler
51Allow: /ocean
52Disallow: /rust
53Disallow: /forest*.py
54Crawl-Delay: 10
55User-Agent: *
56Disallow: /
57Sitemap: https://www.example.com/site.xml";
58
59// Build the Robot for our friendly User-Agent
60let r = Robot::new("FerrisCrawler", txt.as_bytes()).unwrap();
61
62// Ferris has a crawl delay of one second per limb
63// (Crabs have 10 legs so Ferris must wait 10 seconds!)
64assert_eq!(r.delay, Some(10.0));
65
66// Any listed sitemaps are available for any user agent who finds them
67assert_eq!(r.sitemaps, vec!["https://www.example.com/site.xml"]);
68
69// We can also check which pages Ferris is allowed to crawl
70// Notice we can supply the full URL or a relative path?
71assert_eq!(r.allowed("https://www.rust-lang.org/ocean"), true);
72assert_eq!(r.allowed("/ocean"), true);
73assert_eq!(r.allowed("/ocean/reef.html"), true);
74// Sadly Ferris is allowed in the ocean but not in the rust
75assert_eq!(r.allowed("/rust"), false);
76// Ferris is also friendly but not very good with pythons
77assert_eq!(r.allowed("/forest/tree/snake.py"), false);
78```
79
80# Crawling considerations
81
82## Obtaining `robots.txt`
83
84To obtain `robots.txt` requires performing an initial HTTP GET request to the
85domain in question. When handling the HTTP status codes and how they impact `robots.txt`
86the [suggestions made by Google are recommended][google-spec].
87
88- 2xx (success): Attempt to process the resulting payload
89- 3xx (redirection): Follow a reasonable number of redirects
90- 4xx (client error): Assume there are no crawl restrictions except for:
91  - 429 "Too Many Requests": Retry after a reasonable amount of time
92  (potentially set by the "[Retry-After][mozilla-ra]" header)
93- 5xx (server errors): Assume you should not crawl until fixed and/or interpret with care
94
95Even when directed to "assume no crawl restrictions" it is likely reasonable and
96polite to use a small fetch delay between requests.
97
98### Always set a User Agent
99
100For crawling `robots.txt` (and especially for crawling in general) you should
101include a user agent in your request. Most crawling libraries offer adding the
102user agent in a single line.
103
104```ignore
105ClientBuilder.new().user_agent("FerrisCrawler/0.1 (https://ferris.rust/about-this-robot)")...
106```
107
108Beyond respecting `robots.txt` providing a good user agent provides a line of
109communication between you and the web master.
110
111## Beyond the `robots.txt` specification and general suggestions
112
113`texting_robots` provides much of what you need for safe and respectful
114crawling but is not a full solution by itself.
115
116As an example, the HTTP error code 429 ([Too Many Requests][mozilla-tmr]) must be
117tracked when requesting pages on a given site. When a 429 is seen the crawler
118should slow down, even if obeying the Crawl-Delay set in `robots.txt`, and
119potentially using the delay set by the server's [Retry-After][mozilla-ra] header.
120
121An even more complex example is that multiple domains may back on to the same
122backend web server. This is a common scenario for specific products or services
123that host thousands or millions of domains. How you rate limit fairly using the
124`Crawl-Delay` is entirely up to the end user (and potentially the service when
125using HTTP error code 429 to rate limit traffic).
126
127To protect against adverse input the user of Texting Robots is also suggested to
128follow [Google's recommendations][google-spec] and limit input to 500 kibibytes.
129This is not yet done at the library level in case a larger input may be desired
130but may be revisited depending on community feedback.
131
132[mozilla-tmr]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
133[mozilla-ra]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After
134
135## Usage of Texting Robots in other languages
136
137While not yet specifically supporting any languages other than Rust the
138library was designed to support language integrations in the future. Battle
139testing this intepretation of the `robots.txt` specification against the web is
140easier done testing with friends!
141
142A C API through Rust FFI should be relatively easy to provide given Texting Robots
143only relies on strings, floats, and booleans. The lack of native fetching abilities
144should ensure the library is portable across platforms, situations, and languages.
145
146A proof of concept was performed in [WASI][wasi], the "WebAssembly System Interface",
147showing that the library compiles happily and only experiences a 50% or 75% speed
148penalty when used with the [Wasmer][wasmer] (LLVM backend) and [Wasmtime][wasmtime]
149runtimes respectively. No optimizations have been done thus far and there's likely
150low hanging fruit to reap.
151
152See `wasi_poc.sh` for details.
153
154[wasi]: https://wasi.dev/
155[wasmer]: https://wasmer.io/
156[wasmtime]: https://wasmtime.dev/
157
158# Testing
159
160To run the majority of core tests simply execute `cargo test`.
161
162## Unit and Integration Tests
163
164To check Texting Robot's behaviour against the `robots.txt` specification
165almost all unit tests from [Google's C++ robots.txt parser][google-cpp] and
166[Moz's reppy][moz-reppy] have been translated and included.
167
168Certain aspects of the Google and Moz interpretation disagree with each other.
169When this occurred the author deferred to as much common sense as they
170were able to muster.
171
172For a number of popular domains the `robots.txt` of the given domain was
173saved and tests written against them.
174
175[google-cpp]: https://github.com/google/robotstxt
176[moz-reppy]: https://github.com/seomoz/reppy
177
178## Common Crawl Test Harness
179
180To ensure that the `robots.txt` parser will not panic in real world situations
181over 34 million `robots.txt` responses were passed through Texting Robots.
182While this test doesn't guarantee the `robots.txt` files were handled correctly
183it does ensure the parser is unlikely to panic during practice.
184
185Many problematic, invalid, outrageous, and even adversarial `robots.txt`
186examples were discovered in this process.
187
188For full details see [the Common Crawl testing harness][cc-test].
189
190[cc-test]: https://github.com/Smerity/texting_robots_cc_test
191
192## Fuzz Testing
193
194In the local `fuzz` directory is a fuzz testing harness. The harness is not
195particularly sophisticated but does utilize a low level of structure awareness
196through utilizing [dictionary guided fuzzing][dgf]. The harness has already
197revealed one low level unwrap panic.
198
199To run:
200
201```bash
202cargo fuzz run fuzz_target_1 -- -max_len=512 -dict=keywords.dict
203```
204
205Note:
206
207- `cargo fuzz` requires nightly (i.e. run `rustup default nightly` in the `fuzz` directory)
208- If you have multiple processors you may wish to add `--jobs N` after `cargo run`
209
210[dgf]: https://llvm.org/docs/LibFuzzer.html#dictionaries
211
212## Code Coverage with Tarpaulin
213
214This project uses [Tarpaulin](https://github.com/xd009642/tarpaulin) to perform
215code coverage reporting. Given the relatively small surface area of the parser
216and Robot struct the coverage is high. Unit testing is more important for ensuring
217behavioural correctness however.
218
219To get line numbers for uncovered code run:
220
221```bash
222cargo tarpaulin --ignore-tests -v
223```
224
225*/
226
227use core::fmt;
228
229use bstr::ByteSlice;
230
231use percent_encoding::{utf8_percent_encode, AsciiSet, CONTROLS};
232
233use thiserror::Error;
234use url::{ParseError, Position, Url};
235
236mod minregex;
237use minregex::MinRegex as RobotRegex;
238
239#[cfg(test)]
240mod test;
241
242#[cfg(test)]
243mod test_repcpp;
244
245#[cfg(test)]
246mod test_get_robots_url;
247
248mod parser;
249use crate::parser::{robots_txt_parse, Line};
250
251#[derive(Error, Debug)]
252pub enum Error {
253    /// On any parsing error encountered parsing `robots.txt` this error will
254    /// be returned.
255    ///
256    /// Note: Parsing errors should be rare as the parser is highly forgiving.
257    #[error("Failed to parse robots.txt")]
258    InvalidRobots,
259}
260
261fn percent_encode(input: &str) -> String {
262    // Paths outside ASCII must be percent encoded
263    const FRAGMENT: &AsciiSet =
264        &CONTROLS.add(b' ').add(b'"').add(b'<').add(b'>').add(b'`');
265    utf8_percent_encode(input, FRAGMENT).to_string()
266}
267
268/// Construct the URL for `robots.txt` when given a base URL from the
269/// target domain.
270///
271/// # Errors
272///
273/// If there are any issues in parsing the URL, a [ParseError][pe] from the
274/// [URL crate](url) will be returned.
275///
276/// ```rust
277/// use texting_robots::get_robots_url;
278///
279/// let robots_url = get_robots_url("https://example.com/abc/file.html").unwrap();
280/// assert_eq!(robots_url, "https://example.com/robots.txt");
281/// ```
282///
283/// [pe]: ParseError
284pub fn get_robots_url(url: &str) -> Result<String, ParseError> {
285    let parsed = Url::parse(url);
286    match parsed {
287        Ok(mut url) => {
288            if url.cannot_be_a_base() {
289                return Err(ParseError::SetHostOnCannotBeABaseUrl);
290            }
291
292            if url.scheme() != "http" && url.scheme() != "https" {
293                // EmptyHost isn't optimal but I'd prefer to re-use errors
294                return Err(ParseError::EmptyHost);
295            }
296
297            // Setting username to "" removes the username and password
298            if !url.username().is_empty() {
299                url.set_username("").unwrap();
300            }
301            if url.password().is_some() {
302                url.set_password(None).unwrap();
303            }
304
305            match url.join("/robots.txt") {
306                Ok(robots_url) => Ok(robots_url.to_string()),
307                Err(e) => Err(e),
308            }
309        }
310        Err(e) => Err(e),
311    }
312}
313
314#[allow(dead_code)]
315pub struct Robot {
316    // Rules are stored in the form of (regex rule, allow/disallow)
317    // where the regex rule is ordered by original pattern length
318    rules: Vec<(RobotRegex, bool)>,
319    /// The delay in seconds between requests.
320    /// If `Crawl-Delay` is set in `robots.txt` it will return `Some(f32)`
321    /// and otherwise `None`.
322    pub delay: Option<f32>,
323    /// Any sitemaps found in the `robots.txt` file are added to this vector.
324    /// According to the `robots.txt` specification a sitemap found in `robots.txt`
325    /// is accessible and available to any bot reading `robots.txt`.
326    pub sitemaps: Vec<String>,
327}
328
329impl fmt::Debug for Robot {
330    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
331        f.debug_struct("Robot")
332            .field("rules", &self.rules)
333            .field("delay", &self.delay)
334            .field("sitemaps", &self.sitemaps)
335            .finish()
336    }
337}
338
339impl Robot {
340    /// Construct a new Robot object specifically processed for the given user agent.
341    /// The user agent extracts all relevant rules from `robots.txt` and stores them
342    /// internally. If the user agent isn't found in `robots.txt` we default to `*`.
343    ///
344    /// Note: The agent string is lowercased before comparison, as required by the
345    /// `robots.txt` specification.
346    ///
347    /// # Errors
348    ///
349    /// If there are difficulties parsing, which should be rare as the parser is quite
350    /// forgiving, then an [InvalidRobots](Error::InvalidRobots) error is returned.
351    pub fn new(agent: &str, txt: &[u8]) -> Result<Self, anyhow::Error> {
352        // Replace '\x00' with '\n'
353        // This shouldn't be necessary but some websites are strange ...
354        let txt = txt
355            .iter()
356            .map(|x| if *x == 0 { b'\n' } else { *x })
357            .collect::<Vec<u8>>();
358
359        // Parse robots.txt using the nom library
360        let lines = match robots_txt_parse(&txt) {
361            Ok((_, lines)) => lines,
362            Err(e) => {
363                let err = anyhow::Error::new(Error::InvalidRobots)
364                    .context(e.to_string());
365                return Err(err);
366            }
367        };
368
369        // All agents are case insensitive in `robots.txt`
370        let agent = agent.to_lowercase();
371        let mut agent = agent.as_str();
372
373        // Collect all sitemaps
374        // Why? "The sitemap field isn't tied to any specific user agent and may be followed by all crawlers"
375        let sitemaps = lines
376            .iter()
377            .filter_map(|x| match x {
378                Line::Sitemap(url) => match String::from_utf8(url.to_vec()) {
379                    Ok(url) => Some(url),
380                    Err(_) => None,
381                },
382                _ => None,
383            })
384            .collect();
385
386        // Filter out any lines that aren't User-Agent, Allow, Disallow, or CrawlDelay
387        // CONFLICT: reppy's "test_robot_grouping_unknown_keys" test suggests these lines should be kept
388        let lines: Vec<Line> = lines
389            .iter()
390            .filter(|x| !matches!(x, Line::Sitemap(_) | Line::Raw(_)))
391            .copied()
392            .collect();
393
394        // Check if our crawler is explicitly referenced, otherwise we're catch all agent ("*")
395        let references_our_bot = lines.iter().any(|x| match x {
396            Line::UserAgent(ua) => {
397                agent.as_bytes() == ua.as_bstr().to_ascii_lowercase()
398            }
399            _ => false,
400        });
401        if !references_our_bot {
402            agent = "*";
403        }
404
405        // Collect only the lines relevant to this user agent
406        // If there are no User-Agent lines then we capture all
407        let mut capturing = false;
408        if lines.iter().filter(|x| matches!(x, Line::UserAgent(_))).count()
409            == 0
410        {
411            capturing = true;
412        }
413        let mut subset = vec![];
414        let mut idx: usize = 0;
415        while idx < lines.len() {
416            let mut line = lines[idx];
417
418            // User-Agents can be given in blocks with rules applicable to all User-Agents in the block
419            // On a new block of User-Agents we're either in it or no longer active
420            if let Line::UserAgent(_) = line {
421                capturing = false;
422            }
423            while idx < lines.len() && matches!(line, Line::UserAgent(_)) {
424                // Unreachable should never trigger as we ensure it's always a UserAgent
425                let ua = match line {
426                    Line::UserAgent(ua) => ua.as_bstr(),
427                    _ => unreachable!(),
428                };
429                if agent.as_bytes() == ua.as_bstr().to_ascii_lowercase() {
430                    capturing = true;
431                }
432                idx += 1;
433                // If it's User-Agent until the end just escape to avoid potential User-Agent capture
434                if idx == lines.len() {
435                    break;
436                }
437                line = lines[idx];
438            }
439
440            if capturing {
441                subset.push(line);
442            }
443            idx += 1;
444        }
445
446        // Collect the crawl delay
447        let mut delay = subset
448            .iter()
449            .filter_map(|x| match x {
450                Line::CrawlDelay(Some(d)) => Some(d),
451                _ => None,
452            })
453            .copied()
454            .next();
455
456        // Special note for crawl delay:
457        // Some robots.txt files have it at the top, before any User-Agent lines, to apply to all
458        if delay.is_none() {
459            for line in lines.iter() {
460                if let Line::CrawlDelay(Some(d)) = line {
461                    delay = Some(*d);
462                }
463                if let Line::UserAgent(_) = line {
464                    break;
465                }
466            }
467        }
468
469        // Prepare the regex patterns for matching rules
470        let mut rules = vec![];
471        for line in subset
472            .iter()
473            .filter(|x| matches!(x, Line::Allow(_) | Line::Disallow(_)))
474        {
475            let (is_allowed, original) = match line {
476                Line::Allow(pat) => (true, *pat),
477                Line::Disallow(pat) => (false, *pat),
478                _ => unreachable!(),
479            };
480            let pat = match original.to_str() {
481                Ok(pat) => pat,
482                Err(_) => continue,
483            };
484
485            // Paths outside ASCII must be percent encoded
486            let pat = percent_encode(pat);
487
488            let rule = RobotRegex::new(&pat);
489
490            let rule = match rule {
491                Ok(rule) => rule,
492                Err(e) => {
493                    let err = anyhow::Error::new(e)
494                        .context(format!("Invalid robots.txt rule: {}", pat));
495                    return Err(err);
496                }
497            };
498            rules.push((rule, is_allowed));
499        }
500
501        Ok(Robot { rules, delay, sitemaps })
502    }
503
504    fn prepare_url(raw_url: &str) -> String {
505        // Try to get only the path + query of the URL
506        if raw_url.is_empty() {
507            return "/".to_string();
508        }
509        // Note: If this fails we assume the passed URL is valid
510        // i.e. We assume the user has passed us a valid relative URL
511        let parsed = Url::parse(raw_url);
512        let url = match parsed.as_ref() {
513            // The Url library performs percent encoding
514            Ok(url) => url[Position::BeforePath..].to_string(),
515            Err(_) => percent_encode(raw_url),
516        };
517        url
518    }
519
520    /// Check if the given URL is allowed for the agent by `robots.txt`.
521    /// This function returns true or false according to the rules in `robots.txt`.
522    ///
523    /// The provided URL can be absolute or relative depending on user preference.
524    ///
525    /// # Example
526    ///
527    /// ```rust
528    /// use texting_robots::Robot;
529    ///
530    /// let r = Robot::new("Ferris", b"Disallow: /secret").unwrap();
531    /// assert_eq!(r.allowed("https://example.com/secret"), false);
532    /// assert_eq!(r.allowed("/secret"), false);
533    /// assert_eq!(r.allowed("/everything-else"), true);
534    /// ```
535    pub fn allowed(&self, url: &str) -> bool {
536        let url = Self::prepare_url(url);
537        if url == "/robots.txt" {
538            return true;
539        }
540
541        // Filter to only rules matching the URL
542        let mut matches: Vec<&_> = self
543            .rules
544            .iter()
545            .filter(|(rule, _)| rule.is_match(&url))
546            .collect();
547
548        // Sort according to the longest match and then by whether it's allowed
549        // RobotRegex is sorted with preference going from longest to shortest
550        // If there are two rules of equal length, allow and disallow, spec says allow
551        matches.sort_by_key(|x| (&x.0, !x.1));
552
553        match matches.first() {
554            Some((_, is_allowed)) => *is_allowed,
555            // If there are no rules we assume we're allowed
556            None => true,
557        }
558    }
559}
texting_robots/lib.rs

texting_robots/
lib.rs