sqlite-robotstxt 0.0.1-alpha.2

A SQLite extension for parsing robots.txt files
Documentation
# sqlite-robotstxt

A SQLite extension for parsing [`robots.txt`](https://en.wikipedia.org/wiki/Robots.txt) files. Based on [`sqlite-loadable-rs`](https://github.com/asg017/sqlite-loadable-rs) and the [`robotstxt` crate](https://docs.rs/robotstxt/latest/robotstxt/).

## Usage

See if a specified User-Agent can access a specific path, based on the rules of a `robots.txt`.

```sql
select robotstxt_matches(
  readfile('robots.txt'),
  'My-Agent',
  '/path'
); -- 0 or 1
```

Find all User-Agents listed in a `robots.txt` file.

```sql
select *
from robotstxt_user_agents(
  readfile('robots.txt')
);
/*
┌─────────────────────┬────────┬───────┐
│        name         │ source │ rules │
├─────────────────────┼────────┼───────┤
│ *                   │ 1      │       │
│ AdsBot-Google       │ 280    │       │
│ Twitterbot          │ 288    │       │
│ facebookexternalhit │ 295    │       │
└─────────────────────┴────────┴───────┘
*/
```

Use with `sqlite-http` to requests `robots.txt` files on the fly.

```sql
select *
from robotstxt_user_agents(
  http_get_body('https://en.wikipedia.org/robots.txt')
)
limit 10;
/*
┌────────────────────────────┬────────┬───────┐
│            name            │ source │ rules │
├────────────────────────────┼────────┼───────┤
│ MJ12bot                    │ 11     │       │
│ Mediapartners-Google*      │ 15     │       │
│ IsraBot                    │ 19     │       │
│ Orthogaffe                 │ 22     │       │
│ UbiCrawler                 │ 27     │       │
│ DOC                        │ 30     │       │
│ Zao                        │ 33     │       │
│ sitecheck.internetseer.com │ 38     │       │
│ Zealbot                    │ 41     │       │
│ MSIECrawler                │ 44     │       │
└────────────────────────────┴────────┴───────┘
*/
```

## TODO

- [ ] `robotstxt_allowed(rules, path)` overload on `robotstxt_user_agents`
- [ ] `robotstxt_rules`
- [ ] sitemaps?
- [ ] unknown directives?