HQL-Lite

Overview

HQL-Lite is a query language derived from Hanskens full HQL human. HQL stands for Hansken Query Language and can be used to search or match traces. Since not all elements of full HQL can be used in the context of an extraction, extraction plugins use HQL-Lite, a lightweight version of HQL. This document describes the usage of HQL-Lite in the context of extraction plugins.

How does Hansken work?

Let’s say we have a Hansken image hansken_image1 with 10 pdf files, and 5 jpegs.
And our Hansken contains 2 tools:
- PdfPlugin
- JpegTool

Note

All plugins are Hansken tools, but not all Hansken tools are plugins. Some tools are included in Hansken core.

Let’s look at a (simplified) pseudocode example of the inner workings of Hansken:

for each trace in new_traces {
    for each datastream in trace {
        for each tool in hansken_tools {
            if tool.can_this_tool_process_the_provided_trace(trace, datastream) {
                tool.process_the_trace(trace, datastream)
            }
        }
    }
}

So in this example we know the following:

new_traces has
- 10 pdf files
- 5 jpeg files
hansken_tools contains:
- PdfPlugin
- JpegTool

So the question here is, how do we prevent that traces are not processed by incompatible tools?

The answer is the tool.can_this_tool_process_the_provided_trace() part of the pseudocode.

What does `can_this_tool_process_the_provided_trace()` do?

Hansken actually contains many more tools/plugins than these 2, and instead of 15 files/traces, we usually deal with millions.

Note

If each trace has 1 extra second of overhead, 1 million traces would take 11.5 days of extra CPU time

Matchers to the rescue

To reduce the unnecessary overhead of processing all traces (even the ones the tool cannot actually process), Hansken implements the concept of a matcher for each tool. This matcher basically checks the trace for “matching conditions”, that would allow the tool to process it.

Sometimes these matching conditions can be as simple as a specific filename or extension, but are often more elaborate in the sense that they check multiple factors that require some intimate knowledge of Hansken.

What is HQL-Lite?

HQL-Lite is a language based on HQL (Hansken Query Language) that allows plugin developers to write matchers for Hansken Extraction Plugins. It could be said that HQL-Lite contains a subset of HQL features, plus some HQL-Lite unique features that are only interesting for matchers.

Note

Please note that even though the HQL-Lite query is part of the plugin, it is compiled and stored in Hansken during startup to achieve performance.

Why not just use HQL for plugins?

HQL was designed to search for traces stored in the Elasticsearch database. As such, some of its features are tightly coupled to the Elasticsearch implementation, making it difficult to re-implement them for plugins.

Also, even though HQL is more complex than the requirements for matching in plugins, a couple of minor features that are absolutely necessary for matching are not implemented in HQL, as they don’t make much sense from a search point of view. This is because HQL was designed to be used with finished extractions with all the traces stored in the database, while HQL-Lite was designed for active extractions.

HQL-Lite syntax

Matcher	Syntax	remarks
All	`""`	an empty string translates to match for all traces
And	`foo:1 AND bar:2`	the case-sensitive `AND` operator behaves like a logical AND of 2 conditions
Not	`NOT foo` or `-foo`	the case-sensitive `NOT` or `-` negates the expression that follows
Range	`foo>1` or `1<=foo<10`	a numbered-range check with a min or/and max range(s)
Or	`foo:1 OR bar:2`	the case-sensitive `OR` operator behaves like a logical OR of 2 conditions
Data	`$data.foo:1`	see `$data` section below
DataType	`$data.type:raw`	this query matches against the type of the current datastream
Types	`type:email`	this query checks if the trace contains a certain trace type as defined in the Hansken trace model

There are also a couple of general guidelines that apply to all matchers:

Equals/not equals:
- : or = : The most basic of left equals right statements. note that = is also valid.
- != : The opposite of equals, not equals. Note that !: is NOT supported.
Wildcards:
- ? : Match against any single character. E.g. foo:r?w will match against raw, row but not against rowing.
- * : Match against any chars. E.g. foo:r* will match against r, ra, raw, raaaaaw but not against aw.
Exact match: By surrounding a value with quotes, we tell the parser that it is a single value. This is especially helpful for values that might contain separators. E.g. foo:'hello hql-lite'.
CSV: Currently only the type query supports multiple values to check against. E.g. type:email,chatMessage will only return true if both types exist for this trace.
() grouping: You can group statements by putting brackets around them. E.g. foo:1 AND (bar:2 OR bla:3) which translates to foo:1 plus one of the statements in the brackets.
Escaping \"\.\t\r\n:=><!()~/,[]{}: Some characters are used internally by HQL-Lite, and need to be escaped if they are used in the value side of the key-value pair. These values can be escaped by adding prepending \\ to the character(s). Example: foo:foo bar should be foo:foo\\ bar, foo.bar:foo:bar should be foo.bar:foo\\:bar …etc.
- The only exceptions to this rule are unix paths:
  - Acceptable paths:
    - foo:/
    - foo:bar/baz
    - foo:/bar/baz
    - foo:'/bar/baz/he llo'
    - foo:*bar/baz*
  - Unacceptable paths:
    - foo:/bar/ -> this is the regex matcher, which is unsupported in HQL-lite
    - foo:c:\ -> should be foo:c\:\\, both the colon and the slash need to be escaped
    - foo:'c:\' -> should be foo:'c:\\', the slash still needs to be escaped
      - Note
        
        the backslash is the universal escape character, so it always needs to be escaped.

$data matchers

In Hansken, a trace can have multiple datastreams. The exact content of said datastreams is discussed elsewhere, but the basic idea is that a trace can have multiple representations. For example, a trace might have a raw datastream, but after we identify that the raw bytes contain a text file, we might add a separate datastream text.

Note

The process() method of each plugin is called for each datastream of each trace. This is explained in How does Hansken work? . Subsequently, you might have the same property for a different datastream. For example: you might have a data.raw.size and a data.text.size property. The reason you might have the same property multiple times, is because it could have a different meaning.

For example:

data.raw.size: is the size in bytes
data.text.size: is the number of bytes in the text representation of the raw stream

If we want to check if either of these properties is not empty by using a $data matcher, we do:

$data.size>0

When is it useful to use a $data matcher?

For example, there is a simple plugin called LetterCountPlugin, that counts the letters in text based datastreams.

So to match on these text based datastreams, we have 2 choices:

List all the possibilities
- Which is too tedious, and not very flexible when new types are supported
Match on a common property
- More compact, but sometimes difficult to find a common property

In this case we might match on mimeType, which we know is text/plain or text/x-log for 2 of types we want to match:

$data.mimeType=text\\/*

This will match the following:

data.text.mimeType=text\\/plain
data.text.mimeType=text\\/not\\ plain
data.pdf.mimeType=text\\/encoded
data.foo.mimeType=text\\/bar

But will not match any of the following:

data.text.mimeType=txt
data.text.mimeType=pdf
data.text.mime=text\\/plain
data.foo.bar=text\\/plain

How to write a matcher?

The functional requirements for writing a matcher can be summarized in the following:

What does my plugin expect as input?
How can I describe that input with the information Hansken provides?

PdfPlugin example

Let’s say we just finished writing a PdfPlugin. This is a simple plugin that checks if pdf files contain the word the.

So let’s go over our checklist:

What does my plugin expect as input?

PDF files.

How can I describe that input with the information Hansken provides?

Hansken consumes and produces Traces. To that effect, we can only match on trace properties that are available in Hansken.

Match on extension

The easiest way would be to only allow traces with the .pdf extension. Looking at the Hansken trace model (or a Hansken extraction), we can see that there’s a property file which contains a property extension.

So what would that look like in HQL-lite? Something like

file.extension=pdf

Warning

This of course only works if the file has the correct extension (note that matchers are case-sensitive).

So what do we do, if we also want to match pdf files that are (un)intentionally misnamed?

Match on mime-type

Looking at Wikipedia, we see that pdf has a couple of mime-types. In return looking at our extraction and the trace-model, we see both at data.raw.mimeType, with a further explanation in the Hansken trace model that the raw portion of the property is the data type of the datastream.

If we don’t know which datastream has the mimeType property beforehand, we could use the broad-scoped $data. matcher to look at every datastream.

So our matcher becomes:

file.extension=pdf OR
(
  $data.mimeType=application\\/pdf OR
  $data.mimeType=application\\/x-pdf
)

Match on data size

Some pdf files can be huge, meaning that parsing them will need a lot of resources. Could we add a data size check to the matcher? According to the Hansken trace model data has a property size (similar to mimeType) that we could use for this.

Note

This is also a good way to check if a file is empty or not.

Let’s say our cutoff limit is 1 MB, meaning our matcher becomes:

0 < $data.size < 1000000 AND
(
  file.extension=pdf OR
  (
    $data.mimeType=application\\/pdf OR
    $data.mimeType=application\\/x-pdf
  )
)

Match if ‘property is set’

It is not uncommon to have some overlap between tools/plugins. For example:

PdfPlugin: a plugin that only supports pdf documents
DocumentPlugin: this plugin supports a lot of document types, including pdf.

So how would we prevent our plugin from processing a trace that has already been processed by the DocumentPlugin?

The easiest solution would be to check if a certain property has already been set. Meaning, that if both plugins set the foo.bar property, we check if said property has already been set.

So we only process the trace if foo.bar is empty, meaning our matcher becomes:

foo.bar!=* AND
0 < $data.size < 1000000 AND
(
  file.extension=pdf OR
  (
    $data.mimeType=application\\/pdf OR
    $data.mimeType=application\\/x-pdf
  )
)

Match on excluding a certain path

It is also not uncommon to exclude certain paths from your plugin. These paths might contain invalid or encrypted files, for example.

So let’s say we want to exclude all files under in the /tmp/virus path. How do we go about it?

Again, we check our extraction/Hansken trace model, and we see that file.path looks promising.

So excluding /tmp/virus would look something like:

-file.path=/tmp/virus* AND
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
  file.extension=pdf OR
  (
    $data.mimeType=application\\/pdf OR
    $data.mimeType=application\\/x-pdf
  )
)

Match on specific datastream type, an anti-pattern

Warning

Matching on specific datastream types is an anti-pattern! It is not recommended to match on a datastream type. Instead one should match on other datastream properties, such as fileType, mimeType or mimeClass. The reason for this is explained in the paragraph below.

Using a matcher that is too loose or too tight can yield unexpected results. Usually one will match on properties of a datastream like fileType, mimeType or mimeClass as these are reliable properties that have been added by Hansken tools. Matching on a specific datastream says nothing about the type of file. For example a PDF file may be available in a raw as well as in a decrypted datastream. By matching on the datastream type one may exclude traces that were not intended to be excluded. Contrarily, note that matching on a datastream type may include more traces than you expected as well. For example, someone may think “Plugin A puts data on the plain datastream, so I’ll match on the plain datastream with Plugin B”, forgetting that plain may be used by other tools as well. In other words, there may be traces with that datastream type that you did not know of, potentially crashing your plugin. See Data streams for more information.

Now that you know why it is an anti-pattern, lets explain how it would be done (for those edge cases where it’s needed): Lets say we want our PdfPlugin to ONLY process raw datastreams. The best way to do this would be to match on $data.type:raw. Note that $data.type matches against the type of the current datastream, so in this case it matches only when the current datastream is of type raw.

An incorrect way to do it would be to replace $data. matcher(s) with data.raw.. This means the matcher will match whenever a trace has this datastream type, even if the current datastream type is different. Remember that the process method of an extraction plugin is always called once for each datastream on each trace. For example, lets say a trace has two datastreams, raw and text. The matcher would match for both the datastreams because the trace has a raw datastream (even though the current datastream type may be text). This results in the process method being called twice (for raw and for text), which may lead to other bugs if the developer doesn’t know this. For example, the second time the plugin may be trying to overwrite data on a trace which is prohibited.

So, using $data.type, our matcher would look like:

$data.type:raw AND
-file.path=/tmp/virus* AND
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
  file.extension=pdf OR
  (
    $data.mimeType=application\\/pdf OR
    $data.mimeType=application\\/x-pdf
  )
)

How precise should a matcher be?

In practice, only you as the plugin dev can answer this question.

Know that from the point of view of Hansken, we only care that the plugin:

Should not crash: If a matcher does not compile, then your plugin will not be available in Hansken. Tip: be sure to test your plugin with the test framework.
Should not be slow: Matching is designed to be extremely fast, but of course, if you make it too complex it can take longer than we want. In the example above, we calculated that 1 second extra for 1 million traces is 11 days of extra CPU time. Unlike processing, matching is done for every trace, in every extraction iteration, so be careful!
Should match on the bare minimum: Don’t go too far by matching 50 different criteria before allowing a trace to be processed. Note that a lot of (if not all) of these criteria depend on properties set by other tools, and you don’t really have any control on how these tools work.

HQL-Lite

Overview

How does Hansken work?

What does can_this_tool_process_the_provided_trace() do?

Matchers to the rescue

What is HQL-Lite?

Why not just use HQL for plugins?

HQL-Lite syntax

$data matchers

When is it useful to use a $data matcher?

How to write a matcher?

PdfPlugin example

What does my plugin expect as input?

How can I describe that input with the information Hansken provides?

Match on extension

Match on mime-type

Match on data size

Match if ‘property is set’

Match on excluding a certain path

Match on specific datastream type, an anti-pattern

How precise should a matcher be?

What does `can_this_tool_process_the_provided_trace()` do?