unobtanium-crawler 3.0.0

# Databse schema for an unnamed crawler

## Datastrures

### Origin

An origin is used tp uniqely identify a webservice, it consists of a schme (usually http or https) a domain name and a port number (it should be useful for your own private services too)

## Databse

### Origins

* origin_id
* schema
* domain_name
* port

### Ratelimit

Maps an origin to ratelimit information (crawl delay, last request)

### Requests

Contains information on ongoing and completed requests

* request_id
* worker_id
* origin_id
* url
* result (unreachable, timeout, request successful)
* time_request_sent
* request_duration
* comand_id

### Commands

* command_id
* url
* command (check, discover, index, preview, robotstxt, …)
* causal_parent_command_id
* causal_parent_request_id
* time_requested
* time_finished
* requesting_worker_id
* executing_worker_id
* status (waiting, running, finished, failed, …)

### File Results

* file_id
* http_status_code //or http equivalent
* request_id
* mimetype
* filesize
* date_fetched
* canonical_url
* date_created
* date_last_modified

### Events

* file_id
* date
* event_type (file_updated, file_published, date_of_event_represented_by_file)

### Content

* file_id
* index // number that orders the entries by occourrance
* text
* context (body, article, main, footer, header, metadata)
* element_type (title, description, headline, paragraph)
* element_level

### Links

* file_id
*