HQL-Lite
Overview
HQL-Lite is a query language derived from Hanskens full HQL human. HQL stands for Hansken Query Language and can be used to search or match traces. Since not all elements of full HQL can be used in the context of an extraction, extraction plugins use HQL-Lite, a lightweight version of HQL. This document describes the usage of HQL-Lite in the context of extraction plugins.
How does Hansken work?
Let’s say we have a Hansken image
hansken_image1
with 10 pdf files, and 5 jpegs.And our Hansken contains 2 tools:
PdfPlugin
JpegTool
Note
All plugins are Hansken tools, but not all Hansken tools are plugins. Some tools are included in Hansken core.
Let’s look at a (simplified) pseudocode example of the inner workings of Hansken:
for each trace in new_traces {
for each datastream in trace {
for each tool in hansken_tools {
if tool.can_this_tool_process_the_provided_trace(trace, datastream) {
tool.process_the_trace(trace, datastream)
}
}
}
}
So in this example we know the following:
new_traces
has10 pdf files
5 jpeg files
hansken_tools
contains:PdfPlugin
JpegTool
So the question here is, how do we prevent that traces are not processed by incompatible tools?
The answer is the tool.can_this_tool_process_the_provided_trace()
part of the pseudocode.
What does can_this_tool_process_the_provided_trace()
do?
Hansken actually contains many more tools/plugins than these 2, and instead of 15 files/traces, we usually deal with millions.
Note
If each trace has 1 extra second of overhead, 1 million traces would take 11.5 days of extra CPU time
Matchers to the rescue
To reduce the unnecessary overhead of processing all traces (even the ones the tool cannot actually process), Hansken
implements the concept of a matcher
for each tool. This matcher basically checks the trace for “matching
conditions”, that would allow the tool to process it.
Sometimes these matching conditions can be as simple as a specific filename
or extension
, but are often more
elaborate in the sense that they check multiple factors that require some intimate knowledge of Hansken.
What is HQL-Lite?
HQL-Lite is a language based on HQL (Hansken Query Language) that allows plugin developers to write matchers for Hansken Extraction Plugins. It could be said that HQL-Lite contains a subset of HQL features, plus some HQL-Lite unique features that are only interesting for matchers.
Note
Please note that even though the HQL-Lite query is part of the plugin, it is compiled and stored in Hansken during startup to achieve performance.
Why not just use HQL for plugins?
HQL was designed to search for traces stored in the Elasticsearch database. As such, some of its features are tightly coupled to the Elasticsearch implementation, making it difficult to re-implement them for plugins.
Also, even though HQL is more complex than the requirements for matching in plugins, a couple of minor features that are absolutely necessary for matching are not implemented in HQL, as they don’t make much sense from a search point of view. This is because HQL was designed to be used with finished extractions with all the traces stored in the database, while HQL-Lite was designed for active extractions.
HQL-Lite syntax
Matcher |
Syntax |
remarks |
---|---|---|
All |
|
an empty string translates to match for all traces |
And |
|
the case-sensitive |
Not |
|
the case-sensitive |
Range |
|
a numbered-range check with a min or/and max range(s) |
Or |
|
the case-sensitive |
Data |
|
see |
DataType |
|
this query matches against the type of the current datastream |
Types |
|
this query checks if the trace contains a certain trace type as defined in the Hansken trace model |
There are also a couple of general guidelines that apply to all matchers:
Equals/not equals:
:
or=
: The most basic of left equals right statements. note that=
is also valid.!=
: The opposite of equals, not equals. Note that!:
is NOT supported.
Wildcards:
?
: Match against any single character. E.g.foo:r?w
will match againstraw, row
but not againstrowing
.*
: Match against any chars. E.g.foo:r*
will match againstr, ra, raw, raaaaaw
but not againstaw
.
Exact match: By surrounding a value with quotes, we tell the parser that it is a single value. This is especially helpful for values that might contain separators. E.g.
foo:'hello hql-lite'
.CSV: Currently only the
type
query supports multiple values to check against. E.g.type:email,chatMessage
will only returntrue
if both types exist for this trace.()
grouping: You can group statements by putting brackets around them. E.g.foo:1 AND (bar:2 OR bla:3)
which translates tofoo:1
plus one of the statements in the brackets.Escaping
\"\.\t\r\n:=><!()~/,[]{}
: Some characters are used internally by HQL-Lite, and need to be escaped if they are used in the value side of the key-value pair. These values can be escaped by adding prepending\\
to the character(s). Example:foo:foo bar
should befoo:foo\\ bar
,foo.bar:foo:bar
should befoo.bar:foo\\:bar
…etc.The only exceptions to this rule are unix paths:
Acceptable paths:
foo:/
foo:bar/baz
foo:/bar/baz
foo:'/bar/baz/he llo'
foo:*bar/baz*
Unacceptable paths:
foo:/bar/
-> this is the regex matcher, which is unsupported in HQL-litefoo:c:\
-> should befoo:c\:\\
, both thecolon
and theslash
need to be escapedfoo:'c:\'
-> should befoo:'c:\\'
, theslash
still needs to be escapedNote
the backslash is the universal escape character, so it always needs to be escaped.
$data matchers
In Hansken, a trace can have multiple datastreams. The exact content of said datastreams is
discussed elsewhere, but the basic idea is that a trace can have multiple representations. For example, a trace might
have a raw
datastream, but after we identify that the raw bytes contain a text file, we might add a separate
datastream text
.
Note
The process() method of each plugin is called for each datastream of each trace. This is explained in How does Hansken work? . Subsequently, you might have the same property for a different datastream. For example: you might have a data.raw.size and a data.text.size property. The reason you might have the same property multiple times, is because it could have a different meaning.
For example:
data.raw.size: is the size in bytes
data.text.size: is the number of bytes in the text representation of the raw stream
If we want to check if either of these properties is not empty by using a $data
matcher, we do:
$data.size>0
When is it useful to use a $data matcher?
For example, there is a simple plugin called LetterCountPlugin
, that counts the letters in text based datastreams.
So to match on these text based datastreams, we have 2 choices:
List all the possibilities
Which is too tedious, and not very flexible when new types are supported
Match on a common property
More compact, but sometimes difficult to find a common property
In this case we might match on mimeType, which we know is text/plain
or text/x-log
for 2 of types we want to match:
$data.mimeType=text\\/*
This will match the following:
data.text.mimeType=text\\/plain
data.text.mimeType=text\\/not\\ plain
data.pdf.mimeType=text\\/encoded
data.foo.mimeType=text\\/bar
But will not match any of the following:
data.text.mimeType=txt
data.text.mimeType=pdf
data.text.mime=text\\/plain
data.foo.bar=text\\/plain
How to write a matcher?
The functional requirements for writing a matcher can be summarized in the following:
What does my plugin expect as input?
How can I describe that input with the information Hansken provides?
PdfPlugin example
Let’s say we just finished writing a PdfPlugin
. This is a simple plugin that checks if pdf files contain the
word the
.
So let’s go over our checklist:
What does my plugin expect as input?
PDF files.
How can I describe that input with the information Hansken provides?
Hansken consumes and produces Traces. To that effect, we can only match on trace properties that are available in Hansken.
Match on extension
The easiest way would be to only allow traces with the .pdf
extension. Looking at the Hansken trace model (or a
Hansken extraction), we can see that there’s a property file
which contains a property extension
.
So what would that look like in HQL-lite? Something like
file.extension=pdf
Warning
This of course only works if the file has the correct extension (note that matchers are case-sensitive).
So what do we do, if we also want to match pdf files that are (un)intentionally misnamed?
Match on mime-type
Looking at Wikipedia, we see that pdf
has a couple of mime-types. In return looking at our extraction and the
trace-model, we see both at data.raw.mimeType
, with a further explanation in the Hansken trace model that
the raw
portion of the property is the data type of the datastream.
If we don’t know which datastream has the mimeType
property beforehand, we could use the broad-scoped $data.
matcher
to look at every datastream.
So our matcher becomes:
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
Match on data size
Some pdf files can be huge, meaning that parsing them will need a lot of resources. Could we add a data size check to
the matcher? According to the Hansken trace model data
has a property size
(similar to mimeType
) that we
could use for this.
Note
This is also a good way to check if a file is empty or not.
Let’s say our cutoff limit is 1 MB, meaning our matcher becomes:
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
Match if ‘property is set’
It is not uncommon to have some overlap between tools/plugins. For example:
PdfPlugin: a plugin that only supports pdf documents
DocumentPlugin: this plugin supports a lot of document types, including
pdf
.
So how would we prevent our plugin from processing a trace that has already been processed by the DocumentPlugin
?
The easiest solution would be to check if a certain property has already been set. Meaning, that if both plugins set
the foo.bar
property, we check if said property has already been set.
So we only process the trace if foo.bar
is empty, meaning our matcher becomes:
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
Match on excluding a certain path
It is also not uncommon to exclude certain paths from your plugin. These paths might contain invalid or encrypted files, for example.
So let’s say we want to exclude all files under in the /tmp/virus
path. How do we go about it?
Again, we check our extraction/Hansken trace model, and we see that file.path
looks promising.
So excluding /tmp/virus
would look something like:
-file.path=/tmp/virus* AND
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
Match on specific datastream type, an anti-pattern
Warning
Matching on specific datastream types is an anti-pattern! It is not recommended to match on a datastream type. Instead one should match on other datastream properties, such as fileType, mimeType or mimeClass. The reason for this is explained in the paragraph below.
Using a matcher that is too loose or too tight can yield unexpected results. Usually one will match on properties
of a datastream like fileType
, mimeType
or mimeClass
as these are reliable properties that have been added by
Hansken tools. Matching on a specific datastream says nothing about the type of file. For example a PDF file may be
available in a raw
as well as in a decrypted
datastream. By matching on the datastream type one may exclude traces
that were not intended to be excluded.
Contrarily, note that matching on a datastream type may include more traces than you expected as well. For example,
someone may think “Plugin A puts data on the plain
datastream, so I’ll match on the plain
datastream with Plugin B”,
forgetting that plain
may be used by other tools as well. In other words, there may be traces with that datastream
type that you did not know of, potentially crashing your plugin. See Data streams for more information.
Now that you know why it is an anti-pattern, lets explain how it would be done (for those edge cases where it’s needed):
Lets say we want our PdfPlugin
to ONLY process raw
datastreams.
The best way to do this would be to match
on $data.type:raw
. Note that $data.type
matches against the type of the current datastream, so in this case it
matches only when the current datastream is of type raw
.
An incorrect way to do it would be to replace $data.
matcher(s) with data.raw.
. This means the matcher
will match whenever a trace has this datastream type, even if the current datastream type is different.
Remember that the process
method of an extraction plugin is always called once for each datastream on each trace.
For example, lets say a trace has two datastreams, raw
and text
. The matcher would match for both the datastreams
because the trace has a raw
datastream (even though the current datastream type may be text
). This results in the
process
method being called twice (for raw
and for text
), which may lead to other bugs if the developer doesn’t
know this. For example, the second time the plugin may be trying to overwrite data on a trace which is prohibited.
So, using $data.type
, our matcher would look like:
$data.type:raw AND
-file.path=/tmp/virus* AND
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
How precise should a matcher be?
In practice, only you as the plugin dev can answer this question.
Know that from the point of view of Hansken, we only care that the plugin:
Should not crash: If a matcher does not compile, then your plugin will not be available in Hansken. Tip: be sure to test your plugin with the test framework.
Should not be slow: Matching is designed to be extremely fast, but of course, if you make it too complex it can take longer than we want. In the example above, we calculated that 1 second extra for 1 million traces is 11 days of extra CPU time. Unlike processing, matching is done for every trace, in every extraction iteration, so be careful!
Should match on the bare minimum: Don’t go too far by matching 50 different criteria before allowing a trace to be processed. Note that a lot of (if not all) of these criteria depend on properties set by other tools, and you don’t really have any control on how these tools work.