Expand description
The implementation of the robots.txt protocol (or URL exclusion protocol)
with the support of crawl-delay
, sitemap
, and universal *
match
extensions (according to the RFC specification).
Also check out other xwde
projects here.
Examples
- parse the
user-agent
in the providedrobots.txt
file (See Robots):
use robotxt::Robots;
let txt = r#"
User-Agent: foobot
Disallow: *
Allow: /example/
Disallow: /example/nope.txt
"#.as_bytes();
let r = Robots::from_bytes(txt, "foobot");
assert!(r.is_allowed("/example/yeah.txt"));
assert!(!r.is_allowed("/example/nope.txt"));
assert!(!r.is_allowed("/invalid/path.txt"));
- build the new
robots.txt
file from provided directives (See Factory):
use url::Url;
use robotxt::Factory;
let txt = Factory::default()
.header("Robots.txt Header")
.group(["foobot"], |u| {
u.crawl_delay(5)
.header("Rules for Foobot: Start")
.allow("/example/yeah.txt")
.disallow("/example/nope.txt")
.footer("Rules for Foobot: End")
})
.group(["barbot", "nombot"], |u| {
u.crawl_delay(2)
.disallow("/example/yeah.txt")
.disallow("/example/nope.txt")
})
.sitemap("https://example.com/sitemap_1.xml").unwrap()
.sitemap("https://example.com/sitemap_2.xml").unwrap()
.footer("Robots.txt Footer");
println!("{}", txt.to_string());
Links
- Request for Comments: 9309 on RFC-Editor.com
- Introduction to Robots.txt on Google.com
- How Google interprets Robots.txt on Google.com
- What is Robots.txt file on Moz.com
Notes
- The parser is based on Smerity/texting_robots
- The
Host
directive is not supported
Structs
- The
Factory
struct represents a set of formatteduser-agent group
s that can be written to the generic writer in therobots.txt
compliant format. - The
Robots
struct represents the set of directives related to the specificuser-agent
in the providedrobots.txt
file. - The
Section
struct represents a singleuser-agent group
.
Enums
- The
AccessResult
enum represents the result of therobots.txt
retrieval attempt. See Robots::from_access.
Constants
- Google currently enforces a
robots.txt
file size limit of 500 kibibytes (KiB). See How Google interprets Robots.txt.