archive_is 0.2.0

# archive.is api exploration

### trying to archive a site
clicking "save" button with a url in the bar calls the following api:

**Request URL**
```
  Request URL: https://archive.ph/submit/?submitid=XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai%3Futm_source%3Dti_app
  Request Method: GET
  Status Code: 302 
  Remote Address: 91.193.43.144:443
  Referrer Policy: strict-origin-when-cross-origin
```
**Payload (decoded):**
  - **submitid:** XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg
  - **url:** https://www.theinformation.com/articles/alphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai?utm_source=ti_app

**Payload (encoded):**
 - **submitid:** XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg
 - **url:** https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai%3Futm_source%3Dti_app

**Response Headers**
```
  cache-control: private, no-cache, no-store, must-revalidate, maxage=0
  content-length: 0
  date: Thu, 30 Mar 2023 20:02:58 GMT
  expires: Sat, 01 Jan 2000 00:00:00 GMT
  location: https://archive.ph/x2vQs
  pragma: no-cache
  server: nginx
  x-host: p-archiveweb31
```
**Request Headers**
```
  :authority: archive.ph
  :method: GET
  :path: /submit/?submitid=XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai%3Futm_source%3Dti_app
  :scheme: https
  accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
  accept-encoding: gzip, deflate, br
  accept-language: en,en-GB;q=0.9,en-US;q=0.8,en-CA;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3
  cookie: ga=GA1.2.661111166.1680206568
  referer: https://archive.ph/
  sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"
  sec-ch-ua-mobile: ?0
  sec-ch-ua-platform: "Linux"
  sec-fetch-dest: document
  sec-fetch-mode: navigate
  sec-fetch-site: same-origin
  sec-fetch-user: ?1
  upgrade-insecure-requests: 1
  user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
```

afterwards, archive.is redirects the user to the url listed in the `location` field of the returned headers above.  This is the actual archive entry users will use.

### second run
running the save request a second time (this time without the `?utm_source` parameter in the requested url), we get:

**Request URL**
```
  Request URL: https://archive.is/submit/?submitid=ZZiwDA836TdRU7X0tKLjBaqeQRi6F%2Bae2rYWPATBD6BsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai
  Request Method: GET
  Status Code: 302 
  Remote Address: 91.193.43.144:443
  Referrer Policy: strict-origin-when-cross-origin
```
**Payload (decoded)**:
  - **submitid:** ZZiwDA836TdRU7X0tKLjBaqeQRi6F+ae2rYWPATBD6BsmspLFezx51rSKUJnnmqg
  - **url:** https://www.theinformation.com/articles/alphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai

**Payload (encoded):**
  *- *submitid:** ZZiwDA836TdRU7X0tKLjBaqeQRi6F%2Bae2rYWPATBD6BsmspLFezx51rSKUJnnmqg
  - **url:** https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai

**Response Headers**
```
  cache-control: private, no-cache, no-store, must-revalidate, maxage=0
  content-length: 0
  date: Thu, 30 Mar 2023 20:12:14 GMT
  expires: Sat, 01 Jan 2000 00:00:00 GMT
  location: https://archive.is/GkZPl
  pragma: no-cache
  server: nginx
  x-host: p-archiveweb31
```
**Request Headers**
```
  :authority: archive.is
  :method: GET
  :path: /submit/?submitid=ZZiwDA836TdRU7X0tKLjBaqeQRi6F%2Bae2rYWPATBD6BsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai
  :scheme: https
  accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
  accept-encoding: gzip, deflate, br
  accept-language: en,en-GB;q=0.9,en-US;q=0.8,en-CA;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3
  cookie: a=GA1.2.661111166.1680206997
  referer: https://archive.is/
  sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"
  sec-ch-ua-mobile: ?0
  sec-ch-ua-platform: "Linux"
  sec-fetch-dest: document
  sec-fetch-mode: navigate
  sec-fetch-site: same-origin
  sec-fetch-user: ?1
  upgrade-insecure-requests: 1
  user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
```
### third run (on website that's never been archived)
**Request URL**
```
Request URL: https://archive.is/submit/?submitid=ly1WQYvPxnOtd%2Fd9OJdEu7b7QwwdcZq%2FvfxsGZh1LjBsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fgithub.com%2Funsafeoats
Request Method: GET
Status Code: 200 
Remote Address: 91.193.43.144:443
Referrer Policy: strict-origin-when-cross-origin
```
**Response Headers**
```
  accept-ranges: bytes
  cache-control: private, no-cache, no-store, must-revalidate, maxage=0
  content-encoding: gzip
  content-length: 244
  content-type: text/html;charset=utf-8
  date: Thu, 30 Mar 2023 20:55:39 GMT
  expires: Sat, 01 Jan 2000 00:00:00 GMT
  pragma: no-cache
  refresh: 0;url=https://archive.is/wip/WJW3i
  server: nginx
  x-host: p-archiveweb31
```
**Request Headers**
```
  authority: archive.is
  :method: GET
  :path: /submit/?submitid=ly1WQYvPxnOtd%2Fd9OJdEu7b7QwwdcZq%2FvfxsGZh1LjBsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fgithub.com%2Funsafeoats
  :scheme: https
  accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
  accept-encoding: gzip, deflate, br
  accept-language: en,en-GB;q=0.9,en-US;q=0.8,en-CA;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3
  cookie: ga=GA1.2.661111166.1680209721
  referer: https://archive.is/
  sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"
  sec-ch-ua-mobile: ?0
  sec-ch-ua-platform: "Linux"
  sec-fetch-dest: document
  sec-fetch-mode: navigate
  sec-fetch-site: same-origin
  sec-fetch-user: ?1
  upgrade-insecure-requests: 1
  user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
```
### questions

1) where does the `?submitid` field in the `/submit` route come from?  is it just random or does it need to be deterministic?
    - answer found: this is a unique code generated every time someone opens the archive homepage.  if you hit `https://archive.is` with a get request, the html returned will have the following section:
      ```html
      <input type="hidden" name="submitid" value="HtIfTUPGq8UZJNXhb+0DdVFW+6OMcgLs2Ma5Ijrb/5BsmspLFezx51rSKUJnnmqg"/>
      ```
    -  the `submitid` can be extracted from here.

### general workflow for archiving site

1) send get request to `https://archive.is" and extract submitid value`
2) send get request to `https://archive.is/submit/?submitid={encoded submitid}&url={encoded url}`
3) check response headers to see if `location` or `refresh` is present
  - if `location` is present, extract `location` and return it's value
    * returned `status codes` observed in this situation have all been in range [302,] so far
  - if `location` is not present but `refresh` is (with pattern `0;url=https://archive.is/wip/{new archive identifier}`), extract the future url (`https://archive.is/{new archive identifier}`) and return it
    * return `status codes` observed in this situation have all been in range [200,] so far
    * can also monitor the `wip` version of the url found above and wait until it returns a `location` field before returning the url