iocaine 3.0.0 - Docs.rs

Quickly Mark & Kill
===================

Quickly Mark & Kill (henceforth, QMK) is [iocaine]'s built-in default configuration, including a default request handler, and a small template. While nowhere near as advanced as [Nam-Shub of Enki][nsoe], it is a decent default, with room to grow. It is not meant to be a complete, fine tuned thing. It's meant to be a starting point, one that is helpful and useful as it is, but one that is easier to change or extend than Nam-Shub of Enki.

 [iocaine]: https://iocaine.madhouse-project.org/
 [nsoe]: https://git.madhouse-project.org/iocaine/nam-shub-of-enki

<details>
<summary>Table of Contents</summary>

- [Features](#features)
- [Usage](#usage)
- [Configuration](#configuration)
  - [Configuring iocaine](#configuring-iocaine)
  - [Configuring QMK](#configuring-qmk)
- [Metrics](#metrics)

</details>

## Features

- Supports sending robots in [ai.robots.txt] into the maze. (Requires configuration)
- Supports simple browser verification to route a lot of disguising bots into the maze.
- Includes a simple, configurable template.
- Metrics. (Optional, requires configuration)

 [ai.robots.txt]: https://github.com/ai-robots-txt/ai.robots.txt

## Usage

`iocaine start`

That's it. This is a default, it is *meant to be* simple to use. It starts up iocaine listening on `127.0.0.1:42069` with the built-in request handler.

## Configuration

There are two parts that can be configured: iocaine's, and QMK's. They can be configured from the same file, mind you, just different parts! In either case, to augment the default configuration, rather than replacing it, write your overrides into a KDL file, and point iocaine to the containing *directory*. Assuming the files are in, say, `config.d`, relative to iocaine's working directory:

``` shellsession
# iocaine --config-path config.d start
```

To look at the default config, you can use `iocaine show config`. The `show config` command will always show the merged configuration, if you run `iocaine --config-path config.d show config`, it will show the configuration with the overrides in `config.d` applied.

It is also possible to look at *any* embedded file, via the `iocaine show embeds` command:

```shellsession
# iocaine show embeds --contents /defaults/config.kdl
// ...contents of the file...
```

Without the `--contents` argument, we get a list of filenames:

```shellsession
# iocaine show embeds '/defaults/*'
/defaults/config.kdl
...etc..
```

And with no arguments, it will list all files.

### Configuring iocaine

There aren't a whole lot to change here, when it comes to the defaults, but we'll look at them anyway! For example, to enable metrics, we'll need to spin up a new server, and tell the default server to use it. Drop the following snippet into `config.d/metrics.kdl`:

```kdl
prometheus-server default:metrics {
    bind "127.0.0.1:42042"
    //persist-path "/var/lib/iocaine/default.metrics.json"
}
http-server default {
    use metrics=default:metrics
}
```

But that is not all. You can change anything regarding the default server! We can bind it to an abstract unix domain socket, for example! That saves a bit of TCP overhead, and since it isn't on the file system, does not require permission games either.

```kdl
http-server default {
    bind "@iocaine.default.socket"
}
```

The included request handler also supports HAProxy, but no server is spun up by default. We can change that with declaring one. Place the following into `config.d/haproxy.kdl`:

```kdl
haproxy-spoa-server default:spoa {
    bind "@iocaine.default-spoa.socket"
    use metrics=default:metrics handler-from=default
}
```

This will start an HAProxy SPOA server, using the same metrics instance, but a separate instance of the request handler. Wiring this up with HAProxy is left as an exercise for the reader.

Oh, and we can configure an initial seed, too. The purpose of an initial seed is to alter the generated randomness from time to time. Without a seed, the generated data will remain the same as long as the training sources and the template remains the same. With a seed, you can change that. Changing the seed requires a restart, and shouldn't be done too often, but every once in a while helps, it can introduce a bit of variety, and the bots that crawl the maze will be happy that they're not seeing static garbage! They're seeing dynamic garbage. Whee!

Anyway, the initial seed can be set either globally, or on a per-server level:

```kdl
initial-seed-file "/boot/grub/grub.cfg"
http-server default {
    initial-seed "Oceania was at war with Eastasia. Oceania had always been at war with Eastasia."
}
```

Using `initial-seed-file` tells iocaine to read the seed from said file. This can be used to externalize the seed.

### Configuring QMK

Most of the configuration knobs documented herein apply to QMK. All of these options should be placed within the `declare-handler default` block, like such:

```kdl
declare-handler default {
    // configuration comes here!
}
```

Having a number of snippets that all use this structure is supported, the keys will be merged. Lets start with configuring [ai.robots.txt]! Assuming we have its `robots.json` downloaded to `data/robots.json`, the following snippet (to be placed in `config.d/ai.robots.txt.kdl`, for example) will tell the request handler where to find it:

```kdl
declare-handler default {
    ai-robots-txt-path "data/robots.json"
}
```

#### Sources

By default, iocaine will use its own source code (and this document, and the default config, and the request handler) as its source for training data and wordlist. This is simple, but the output is somewhat disappointing. You may wish to give the script something else to train on.

Once you have a good corpus, you can point the script at it via a snippet similar to the following (place it in, say, `config.d/sources.kdl`):

```kdl
declare-handler default {
    sources {
        training-corpus "/path/to/file1.txt" "/path/to/file2.txt" // ..etc
        wordlists "/path/to/file.txt" "/path/to/another.txt"
    }
}
```

#### Unwanted visitors

While gently guiding known and disguising crawlers into the maze will get us quite far, there are a number of other bots we may not wish to see join the gang in there. This can be easily arranged, with a quick drop into a file in `config.d`, like `config.d/unwanted-visitors.kdl`:

```kdl
declare-handler default {
    unwanted-visitors Perplexity GoogleBot
}
```

Just list whatever you want there! Do note that these are patterns, they're not regexp. If any of these strings is found in the `User-Agent` field, they'll find themselves in the maze.

#### Logging

If logging is enabled, QMK will log every request to standard output, in JSON format: various request properties (the request method, path, headers, and queries), along with the decision, and the ruleset responsible for the decision. Each request emits one line of JSON.

To enable it, drop the following into `config.d/logging.kdl`:

``` kdl
declare-handler default {
    logging
}
```

#### Trusted Decision Header

When using QMK with HAProxy, where decision making and output generation is done in discrete steps, the current practice to channel the decision to the output generation is to pass it as a HTTP header. HAProxy can make sure that the header never reaches iocaine from the outside, and itself is the one to set it. But we need to extract that header!

QMK's `decide()` function can do that. If the `trusted-decision-header` property is set in its config, that's the header it will check. If the header is set, `decide()` will short circuit, and return the value of the header, without performing the rest of the decision making.

This makes it possible to use QMK both as the filter function, and as the garbage generator when using HAProxy.

```kdl
declare-handler default {
    trusted-decision-header "iocaine-decision"
}
```

Setting this property on a handler that is used in a server that isn't guarded against receiving this header from untrusted sources will leave a big door open.

#### Garbage generation settings

There are a couple of knobs you can tweak, to change how much garbage is generated. The example below is - hopefully - self explanatory:

```kdl
declare-handler default {
    garbage {
        title {
            min-words 2
            max-words 15
        }
        paragraphs {
            min-count 1
            max-count 5
            min-words 10
            max-words 69
        }
        links {
            min-count 1
            max-count 8
            min-uri-parts 1
            max-uri-parts 2
            min-text-words 2
            max-text-words 5
            uri-separator "-"
        }
    }
}
```

#### Template

The built-in template is intentionally simple, and the request handler doesn't let you configure much about it. You can, however, change the template! Mind you, the template is purely for display. It can only work with garbage generated ahead of time. Nevertheless, you can still give it your own flair!

To change the template, you can use either of the `template` or `template-file` keys to define the template inline, or pull it from a file. As usual, place a small snippet into, say, `config.d/template.kdl`:

```kdl
declare-handler default {
    template-file "/path/to/a/file.html"
    template #"""
      <!doctype html>
      <!-- you can imagine the rest here -->
      """#
}
```

Apart from this, you can also control whether the HTML should be minified (it is minfied by default):

```kdl
declare-handler default {
    minify #false
}
```

## Metrics

When a `prometheus-server` is configured, and bound to the default server, the following metrics will be available (along with a number of default process metrics):

<dl>
  <dt><code>qmk_requests{host}</code></dt>
  <dd>

  The number of requests served, keyed by host.

  </dd>

  <dt><code>qmk_ruleset_hits{ruleset, outcome}</code></dt>
  <dd>

  Number of times a particular rule was hit, and its outcome. The outcome is either `garbage` or `default`, and the rulesets are `ai.robots.txt`, `major-browsers`, `unwanted-visitors`, or `default`.

  </dd>

  <dt><code>qmk_garbage_generated{host}</code></dt>
  <dd>

  Amount of garbage generated, in bytes, keyed by host.

  </dd>
</dl>