# HQL-Lite

## Overview

HQL-Lite is a query language derived from Hanskens full HQL human. HQL stands for Hansken Query Language and can be
used to search or match traces. Since not all elements of full HQL can be used in the context of an extraction,
extraction plugins use HQL-Lite, a lightweight version of HQL. This document describes the usage of HQL-Lite in the
context of extraction plugins.

.. _howdoeshanskenwork:

## How does Hansken work?

- Let's say we have a Hansken image `hansken_image1` with 10 pdf files, and 5 jpegs.
- And our Hansken contains 2 tools:
  - PdfPlugin
  - JpegTool

.. note:: All plugins are Hansken tools, but not all Hansken tools are plugins. Some tools are included in Hansken core.

Let's look at a (simplified) pseudocode example of the inner workings of Hansken:

```python
    for each trace in new_traces {
        for each datastream in trace {
            for each tool in hansken_tools {
                if tool.can_this_tool_process_the_provided_trace(trace, datastream) {
                    tool.process_the_trace(trace, datastream)
                }
            }
        }
    }
```

So in this example we know the following:

- `new_traces` has
  - 10 pdf files
  - 5 jpeg files
- `hansken_tools` contains:
  - PdfPlugin
  - JpegTool

So the question here is, how do we prevent that traces are not processed by incompatible tools?

The answer is the `tool.can_this_tool_process_the_provided_trace()` part of the pseudocode.

### What does `can_this_tool_process_the_provided_trace()` do?

Hansken actually contains many more tools/plugins than these 2, and instead of 15 files/traces, we usually deal with
millions.

.. note:: If each trace has 1 extra second of overhead, 1 million traces would take 11.5 days of extra CPU time

#### Matchers to the rescue

To reduce the unnecessary overhead of processing all traces (even the ones the tool cannot actually process), Hansken
implements the concept of a `matcher` for each tool. This _matcher_ basically checks the _trace_ for _"matching
conditions"_, that would allow the tool to process it.

Sometimes these _matching conditions_ can be as simple as a specific `filename` or `extension`, but are often more
elaborate in the sense that they check multiple factors that require some intimate knowledge of Hansken.

## What is HQL-Lite?

HQL-Lite is a language based on HQL (Hansken Query Language) that allows plugin developers to write _matchers_ for
Hansken Extraction Plugins. It could be said that HQL-Lite contains a subset of HQL features, plus some HQL-Lite unique
features that are only interesting for _matchers_.

.. note::
  Please note that even though the HQL-Lite query is part of the plugin, it is compiled and stored in Hansken during
  startup to achieve performance.

### Why not just use HQL for plugins?

HQL was designed to search for traces stored in the Elasticsearch database. As such, some of its features are tightly
coupled to the Elasticsearch implementation, making it difficult to re-implement them for plugins.

Also, even though HQL is more complex than the requirements for _matching_ in plugins, a couple of minor features that
are absolutely necessary for _matching_ are not implemented in HQL, as they don't make much sense from a search point of
view. This is because HQL was designed to be used with _finished extractions_ with all the traces stored in the
database, while HQL-Lite was designed for _active extractions_.

.. _hqllite syntax:

### HQL-Lite syntax

| Matcher  | Syntax                 | remarks                                                                                                |
|----------|------------------------|--------------------------------------------------------------------------------------------------------|
| All      | `""`                   | an empty string translates to match for __all__ traces                                                 |
| And      | `foo:1 AND bar:2`      | the case-sensitive `AND` operator behaves like a logical AND of 2 conditions                           |
| Not      | `NOT foo` or `-foo`    | the case-sensitive `NOT` or `-` negates the expression that follows                                    |
| Range    | `foo>1` or `1<=foo<10` | a numbered-range check with a min or/and max range(s)                                                  |
| Or       | `foo:1 OR bar:2`       | the case-sensitive `OR` operator behaves like a logical OR of 2 conditions                             |
| Data     | `$data.foo:1`          | see `$data` section below                                                                              |
| DataType | `$data.type:raw`       | this query matches against the type of the current datastream                                          |
| Types    | `type:email`           | this query checks if the trace contains a certain trace type as defined in the Hansken trace model     |

There are also a couple of general guidelines that apply to all matchers:

- Equals/not equals:
  - `:` or `=` : The most basic of left equals right statements. note that `=` is also valid.
  - `!=` : The opposite of equals, not equals. Note that `!:` is __NOT__ supported.
- Wildcards:
  - `?` : Match against any single character. E.g. `foo:r?w` will match against `raw, row` but not against `rowing`.
  - `*` : Match against any chars. E.g. `foo:r*` will match against `r, ra, raw, raaaaaw` but not against `aw`.
- Exact match: By surrounding a value with quotes, we tell the parser that it is a single value. This is especially
  helpful for values that might contain separators. E.g. `foo:'hello hql-lite'`.
- CSV: Currently only the `type` query supports multiple values to check against. E.g. `type:email,chatMessage` will only
  return `true` if both types exist for this trace.
- `()` grouping: You can group statements by putting brackets around them. E.g. `foo:1 AND (bar:2 OR bla:3)` which
  translates to `foo:1` plus one of the statements in the brackets.
- Escaping `\"\\ .\t\r\n:=><!()~/,[]{}`: Some characters are used internally by HQL-Lite, and need to be escaped if they
  are used in the value side of the key-value pair. These values can be escaped by adding prepending `\\` to the
  character(s). Example: `foo:foo bar` should be `foo:foo\\ bar`, `foo.bar:foo:bar` should be `foo.bar:foo\\:bar`
  ...etc.
  - The only exceptions to this rule are **unix paths**: 
    - Acceptable paths:
      - `foo:/`
      - `foo:bar/baz`
      - `foo:/bar/baz`
      - `foo:'/bar/baz/he llo'`
      - `foo:*bar/baz*`
    - Unacceptable paths:
      - `foo:/bar/` -> this is the regex matcher, which is unsupported in HQL-lite 
      - `foo:c:\`   -> should be `foo:c\:\\`, both the `colon` and the `slash` need to be escaped
      - `foo:'c:\'` -> should be `foo:'c:\\'`, the `slash` still needs to be escaped
        - .. note:: the backslash is the universal escape character, so it **always** needs to be escaped. 

#### $data matchers

In Hansken, a trace can have multiple :ref:`datastreams <datastreams>`. The exact content of said datastreams is
discussed elsewhere, but the basic idea is that a trace can have multiple representations. For example, a trace might
have a `raw` datastream, but after we identify that the raw bytes contain a __text__ file, we might add a separate
datastream `text`.

.. note::
  The `process()` method of each plugin is called for each datastream of each trace. This is explained
  in :ref:`How does Hansken work? <howdoeshanskenwork>` . Subsequently, you might have the same property for a
  different datastream. For example: you might have a `data.raw.size` and a `data.text.size` property. The reason you
  might have the same property multiple times, is because it could have a different meaning.

For example:

- data.raw.size: is the size in bytes
- data.text.size: is the number of bytes in the text representation of the raw stream

If we want to check if either of these properties is not empty by using a `$data` matcher, we do:

```text
  $data.size>0
```

##### When is it useful to use a $data matcher?

For example, there is a simple plugin called `LetterCountPlugin`, that counts the letters in text based datastreams.

So to match on these text based datastreams, we have 2 choices:

- List all the possibilities
  - Which is too tedious, and not very flexible when new types are supported
- Match on a common property
  - More compact, but sometimes difficult to find a common property

In this case we might match on mimeType, which we know is `text/plain` or `text/x-log` for 2 of types we want to match:

```text
  $data.mimeType=text\\/*
```

This will match the following:

- `data.text.mimeType=text\\/plain`
- `data.text.mimeType=text\\/not\\ plain`
- `data.pdf.mimeType=text\\/encoded`
- `data.foo.mimeType=text\\/bar`

But will __not__ match any of the following:

- `data.text.mimeType=txt`
- `data.text.mimeType=pdf`
- `data.text.mime=text\\/plain`
- `data.foo.bar=text\\/plain`

## How to write a matcher?

The functional requirements for writing a matcher can be summarized in the following:

1. What does my plugin expect as input?
2. How can I describe that input with the information Hansken provides?

### PdfPlugin example

Let's say we just finished writing a `PdfPlugin`. This is a simple plugin that checks if pdf files contain the
word `the`.

So let's go over our checklist:

#### _What does my plugin expect as input?_

PDF files.

#### _How can I describe that input with the information Hansken provides?_

Hansken consumes and produces :ref:`Traces <traces>`. To that effect, we can only match on trace properties that are
available in Hansken.

##### Match on extension

The easiest way would be to only allow traces with the `.pdf` extension. Looking at the :ref:`Hansken trace model` (or a
Hansken extraction), we can see that there's a property `file`
which contains a property `extension`.

So what would that look like in HQL-lite? Something like

```text
  file.extension=pdf
```

.. warning:: This of course **only** works if the file has the correct extension (note that matchers are case-sensitive).

So what do we do, if we also want to match pdf files that are (un)intentionally misnamed?

##### Match on mime-type

Looking at Wikipedia, we see that `pdf` has a couple of mime-types. In return looking at our extraction and the
trace-model, we see both at `data.raw.mimeType`, with a further explanation in the :ref:`Hansken trace model` that
the `raw` portion of the property is the __data type__ of the datastream.

If we don't know which datastream has the `mimeType` property beforehand, we could use the broad-scoped `$data.` matcher
to look at every datastream.

So our matcher becomes:

```text
  file.extension=pdf OR
  (
    $data.mimeType=application\\/pdf OR
    $data.mimeType=application\\/x-pdf
  )
```

##### Match on data size

Some pdf files can be huge, meaning that parsing them will need a lot of resources. Could we add a data size check to
the matcher? According to the :ref:`Hansken trace model` `data` has a property `size` (similar to `mimeType`) that we
could use for this.

.. note:: This is also a good way to check if a file is empty or not.

Let's say our cutoff limit is 1 MB, meaning our matcher becomes:

```text
  0 < $data.size < 1000000 AND
  (
    file.extension=pdf OR
    (
      $data.mimeType=application\\/pdf OR
      $data.mimeType=application\\/x-pdf
    )
  )
```

##### Match if 'property is set'

It is not uncommon to have some overlap between tools/plugins. For example:

- PdfPlugin: a plugin that only supports pdf documents
- DocumentPlugin: this plugin supports a lot of document types, including `pdf`.

So how would we prevent our plugin from processing a trace that has already been processed by the `DocumentPlugin`?

The easiest solution would be to check if a certain property has already been set. Meaning, that if both plugins set
the `foo.bar` property, we check if said property has already been set.

So we __only__ process the trace if `foo.bar` is __empty__, meaning our matcher becomes:

```text
  foo.bar!=* AND
  0 < $data.size < 1000000 AND
  (
    file.extension=pdf OR
    (
      $data.mimeType=application\\/pdf OR
      $data.mimeType=application\\/x-pdf
    )
  )
```

##### Match on excluding a certain path

It is also not uncommon to exclude certain paths from your plugin. These paths might contain invalid or encrypted files,
for example.

So let's say we want to exclude all files under in the `/tmp/virus` path. How do we go about it?

Again, we check our extraction/:ref:`Hansken trace model`, and we see that `file.path` looks promising.

So excluding `/tmp/virus` would look something like:

```text
  -file.path=/tmp/virus* AND
  foo.bar!=* AND
  0 < $data.size < 1000000 AND
  (
    file.extension=pdf OR
    (
      $data.mimeType=application\\/pdf OR
      $data.mimeType=application\\/x-pdf
    )
  )
```

.. _hql datastreams:

##### Match on specific datastream type, an anti-pattern

.. warning:: Matching on specific datastream types is an anti-pattern! It is not recommended to match on a datastream 
   type. Instead one should match on other datastream properties, such as `fileType`, `mimeType` or `mimeClass`. 
   The reason for this is explained in the paragraph below.

Using a matcher that is too loose or too tight can yield unexpected results. Usually one will match on properties
of a datastream like `fileType`, `mimeType` or `mimeClass` as these are reliable properties that have been added by 
Hansken tools. Matching on a specific datastream says nothing about the type of file. For example a PDF file may be 
available in a `raw` as well as in a `decrypted` datastream. By matching on the datastream type one may exclude traces 
that were not intended to be excluded. 
Contrarily, note that matching on a datastream type may include *more* traces than you expected as well. For example, 
someone may think "Plugin A puts data on the `plain` datastream, so I'll match on the `plain` datastream with Plugin B", 
forgetting that `plain` may be used by other tools as well. In other words, there may be traces with that datastream 
type that you did not know of, potentially crashing your plugin. See :ref:`datastreams` for more information.

Now that you know why it is an anti-pattern, lets explain how it would be done (for those edge cases where it's needed): 
Lets say we want our `PdfPlugin` to __ONLY__ process `raw` datastreams. 
The best way to do this would be to match
on `$data.type:raw`. Note that `$data.type` matches against the type of the current datastream, so in this case it
matches only when the current datastream is of type `raw`.

An __incorrect__ way to do it would be to replace `$data.` matcher(s) with `data.raw.`. This means the matcher
will match whenever a trace has this datastream type, even if the current datastream type is different.
Remember that the `process` method of an extraction plugin is always called once for each datastream on each trace.
For example, lets say a trace has two datastreams, `raw` and `text`. The matcher would match for both the datastreams
because the trace has a `raw` datastream (even though the current datastream type may be `text`). This results in the 
`process` method being called twice (for `raw` and for `text`), which may lead to other bugs if the developer doesn't
know this. For example, the second time the plugin may be trying to overwrite data on a trace which is prohibited. 

So, using `$data.type`, our matcher would look like:

```text
  $data.type:raw AND
  -file.path=/tmp/virus* AND
  foo.bar!=* AND
  0 < $data.size < 1000000 AND
  (
    file.extension=pdf OR
    (
      $data.mimeType=application\\/pdf OR
      $data.mimeType=application\\/x-pdf
    )
  )
```

## How precise should a matcher be?

In practice, only you as the plugin dev can answer this question.

Know that from the point of view of Hansken, we only care that the plugin:

- __Should not crash__: If a matcher does not compile, then your plugin will not be available in Hansken. Tip: be sure
  to test your plugin with the :ref:`test framework <test_framework>`.
- __Should not be slow__: Matching is designed to be extremely fast, but of course, if you make it too complex it can
  take longer than we want. In the example above, we calculated that 1 second extra for 1 million traces is 11 days of
  extra CPU time. Unlike processing, matching is done for __every trace__, in every extraction iteration, so be careful!
- __Should match on the bare minimum__: Don't go too far by matching 50 different criteria before allowing a trace to be
  processed. Note that a lot of (if not all) of these criteria depend on properties set by other tools, and you don't
  really have any control on how these tools work.