ieql 0.2.7 - Docs.rs

# IEQL Specification (v0.1.0)

This document specifies the format of IEQL queries. For an overview of IEQL, please see [README.md](README.md).

## Data Format

IEQL queries must be valid RON objects. They must have the root keys `triggers`, `scope`, `threshold`, and `response` (all other root keys are optional). Note that the following is not valid RON; instead, it is meant to provide a primer on the general structure of a IEQL query.

```ron
Query (
  triggers: <array of trigger objects>,
  scope: (
    documents: <pattern object>,
    content: <the type of content to match>,
  ),
  threshold: (
    considers: <array of trigger IDs and/or other compositions>,
    required: <number of considered items required for a match>,
  ),
  response: (
    type: <`full` or `partial`>,
    include: <array of items to include>,
  ),
)
```

> **Why RON?**
> 
> We use RON for IEQL queries because it is type-explicit and verbose. Remember — you shouldn't be writing IEQL by hand! Instead, you should be using a tool to generate IEQL queries for you.

#### Pattern Object

**Pattern objects** are used for RegEx-like pattern matching throughout IEQL. They must have two keys: `content` and `kind`. `content` is the content to match (either a RegEx query or raw text), and `kind` defines whether the parser should treat `content` as RegEx or as raw text. (`kind` may be one of `Raw` or `RegEx`.)

An example pattern object is the following:

```ron
Pattern (
  content: "something.+",
  kind: RegEx,
)
```

#### Trigger Object

A **trigger object** must have two keys: `pattern` and `id`. `pattern` must be a valid pattern object. `id` is the unique ID assigned to the trigger object that will be referenced later in the `threshold` (string).

An example trigger object might look like the following:

```ron
Trigger (
  pattern: (
    content: "IEQL is ([ A-Za-z]+)(awesome|incredible)",
    kind: RegEx
  ),
  id: "awesomeQuery"
}
```

#### Scope

The **scope** defines the type of data that will be fed to the triggers. The **scope** must be a valid RON object with two keys: `documents` and `content`.

**`documents`** (string) must be a valid pattern object, and matches the URL of documents. For example, for a IEQL query to be only performed on `.net` or `.org` domains, `documents` would be set to match the pattern `(https?:\/\/)?(www\.)?[a-z0-9-]+\.(net|org)(\.[a-z]{2,3})?`. It seems complicated, but that's the power of RegEx! To match all URLs, set `documents` to the RegEx pattern object representing `.+`.

**`content`** (string) may be one of `Raw` or `Text`. (Additional values are possible; refer to your implementations' source code for more information about the available content types.) `Raw` means that the triggers will be fed the raw document content. `Text` means that the trigger will be fed a cleaned version of the document with only its text (note that this functionality is only available for some content types; if the raw text cannot be extracted, the program will be provided the equivalent of `Raw`).

An example scope definition might be as follows:

```ron
Scope (
  pattern: {
    content: ".+",
    type: RegEx,
  },
  content: Text
)
```

This scope would consider all Internet documents to be in scope, and would feed the preprocessed text to the triggers.

#### Threshold

The **threshold** defines what triggers are required in order for the document to be a match. The threshold object must have the root keys `considers`, `required`, and optionally `inverse`.

**`considers`** (array) lists the various triggers and/or other threshold objects that should be considered. Triggers are identified by their IDs (strings), while other thresholds are themselves valid threshold objects. In this way, a threshold object may itself contain other threshold objects. (See below for an example.)

**`requires`** (integer) defines the minimum number of objects listed in `considers` that must evaluate to `true` in order for the threshold to be met (and the IEQL query to match). For an `OR`-like relationship between `considers`, `requires` should be `1`. For an `AND`-like relationship, `requires` should be the total number of objects in `considers`. If `requires` is greater than the number of objects in `considers`, the threshold will never be met; conversely, if `requires` is `0`, the threshold will _always_ be met.

**`inverse`** (boolean) determines whether the result will be inversed. With the inclusion of `inverse`, _any boolean expression can be expressed in IEQL_.

If three triggers are available, `A`, `B`, and `C`, we can define a threshold to match whenever any match as follows:

```ron
Threshold (
  considers: [Trigger("A"), Trigger("B"), Trigger("C")],
  requires: 1,
  inverse: false,
)
```

Alternatively, if we want _all_ triggers to be required for a match, we can write the following threshold:

```ron
Threshold (
  considers: [Trigger("A"), Trigger("B"), Trigger("C")],
  requires: 3,
  inverse: false,
)
```

If we want _no_ triggers to match—i.e., a match is defined as whenever _no triggers_ are activated—we could define the following threshold:

```ron
Threshold (
  considers: [Trigger("A"), Trigger("B"), Trigger("C")],
  requires: 1,
  inverse: true,
)
```

Finally, if we want to define a threshold in which `A` _must_ match _and_ either `B` or `C` (or both) must match, we can write the following threshold:

```ron
Threshold (
  considers: [Trigger("A"), NestedThreshold (
    Threshold (
      considers: [
        Trigger("B"),
        Trigger("C),
      ],
      requires: 1,
      inverse: false,
    )
  )],
  requires: 2,
  inverse: false,
)
```

Threshold composition is very powerful!

#### Response

The **response** defines the type of data that the IEQL query will return. It must have two keys: `type` and `include`.

**`type`** is either `Full` or `Partial`. `Full` indicates that each match of the IEQL query should be its own IEQL response, and no MapReduce-style operations should be performed on it. (In order words, it should be piped directly into the database.) `Partial` indicates that the IEQL query should return an IEQL partial response, which can then be aggregated.

**`include`** specifies the information that should be included in the IEQL response. It is an array of strings. The following includes are supported:

* `Excerpt` (full only)
* `Url` (full only)
* `Domain`
* `Mime`
* `FullContent`

#### Example Full Query

```ron
Query (
    response: (
        kind: Full,
        include: [
            Excerpt,
            Url,
        ],
    ),
    scope: (
        pattern: (
            content: ".+",
            kind: RegEx,
        ),
        content: Raw,
    ),
    threshold: (
        considers: [
            Trigger("A"),
            NestedThreshold((
                considers: [
                    Trigger("B"),
                    Trigger("C"),
                ],
                requires: 1,
                inverse: false,
            )),
        ],
        requires: 2,
        inverse: false,
    ),
    triggers: [
        (
            pattern: (
                content: "hello",
                kind: RegEx,
            ),
            id: "A",
        ),
        (
            pattern: (
                content: "everyone",
                kind: RegEx,
            ),
            id: "B",
        ),
        (
            pattern: (
                content: "around",
                kind: RegEx,
            ),
            id: "C",
        ),
    ],
    id: Some("Test Trigger #1"),
)
```

Minified, this query would look like this:

```ron
(response:(kind:Full,include:[Excerpt,Url,],),scope:(pattern:(content:".+",kind:RegEx,),content:Raw,),threshold:(considers:[Trigger("A"),NestedThreshold((considers:[Trigger("B"),Trigger("C"),],requires:1,inverse:false,)),],requires:2,inverse:false,),triggers:[(pattern:(content:"hello",kind:RegEx,),id:"A",),(pattern:(content:"everyone",kind:RegEx,),id:"B",),(pattern:(content:"around",kind:RegEx,),id:"C",),],id:Some("Test Trigger #1"),)
```

#### Filetype

IEQL queries typically end with the file extension `.ieql`. Using query files that _do not_ end in `.ieql` will likely generate a warning by the IEQL interpreter.

## Philosophy

As you implement or compose IEQL queries, keep the following in mind:

- All the triggers are fed the same data; triggers may not have different scopes or preprocessing steps. If triggers require different scopes, they should not be part of the same IEQL query.
- All queries should be able to identify the specific part(s) of the document that caused them to trigger.
- You should _never_ allow non-trusted entities to create IEQL queries. If you are running IEQL queries on your own server, you should be careful which queries you run—after all, a malicious query could severely slow down your scanning infrastructure.