# Traces & Trace model

.. _traces:

## Traces

Traces are structured data objects produced by tools/plugins during an extraction. A trace represents a piece of
information found in an evidence file.

The following figure shows the main elements of a trace. Each element is described in more detail in the following
paragraphs.

![trace.svg](trace.svg)

.. The following diagram is used to produce the trace.png file

    @startuml
    hide empty methods

    Interface Trace << (T,#FF7700 >> {
      {static} String id
      {static} String name
      {static} String[] path
      ..
      Map<Type, Properties> properties
      ..
      List<DataStream>
      List<Trace> children
    }

    Interface DataStream  << (D,orchid>> {
      {static} String type
      {static} String length
      {static} byte[] data
      ..
      String fileType
      Properties properties
    }

    Trace *-- Trace : a trace could have child traces
    Trace *-- DataStream : a trace could have data streams

    @enduml

### Types and Properties

A trace has properties that describe the information of it by means of a property value. Trace properties are grouped by
a trace type. A trace can have multiple types.

All types and properties that can be set are defined in the :ref:`Hansken trace model`.

An example of a type is `document`, which could have the properties `application` and `createdOn`. The trace will have a
type `document`, and can the following properties with values:

    document.application: Libre Office
    document.createdOn: 2021-09-18 20:00:00

### Intrinsic properties

A trace has several *intrinsic* properties. These are properties that are not related to a trace type. The intrinsic
properties available to extraction plugins are:

* `id`: a unique identifier of the trace, generated when a trace is created
* `name`: a name given to a trace when it is created
* `path`: a logical path of the trace, of which the elements are the names of traces from the root trace until this
  trace

.. _datastreams:

### Data streams

Typically, a trace represents a piece of data found in an evidence file. This data is part of the trace and available as
a data stream. A trace can have multiple data streams. Each data stream has a type. Data streams can also have
properties that apply to the data stream itself. The data stream properties are modeled as properties of the trace, in
the following pattern:
`data.<datastreamtype>.propertyname` (where `<datastreamtype>` is substituted by the actual type of the data stream).

The set of data stream types and data stream properties is fixed. All allowed types and properties are defined in the
:ref:`Hansken trace model` (see `data`).

An important data stream property is the `fileType` property. This property contains a textual description of the
*detected* file type for the data stream. An example of a `fileType` is  'Adobe Pdf'. The `fileType` is a good candidate
to use in a extraction plugin 'matcher'. This `fileType` is detected by Hansken using file type heuristics, which are
primarily based on the data stream bytes itself, and secondarily on other metadata such as a file extension.
(N.B. The `fileType` is detected in Hansken by the extraction tool `Firefli`.) For more information on how datastream 
properties can be used for matching, see :ref:`here<hql datastreams>`.

Note that not all traces have data streams. In these cases it is a trace of meta-data derived from another trace.

Usually, each trace with data has a data stream of type `raw`. This data stream contains the bytes of the traces as they
were found when the trace was created. In some occasions, the `raw` data can be represented in a different form before it
can be processed further, for example if the data can be decoded or decrypted. Hansken tools and plugins can decode
the `raw` data stream to a standard UTF-8 data stream, or can decrypt the data if a decryption key is present. Hansken
tools and extraction plugins can store the new data at the new trace in a new data stream. This new data stream has a
different type than the `raw` type.

Examples in code can be found here:

* Adding a Datastream :ref:`Java <datastreams java>`
* Adding a Datastream :ref:`Python <datastreams python>`

.. TODO: Add a link to matcher in the paragraph above

.. _child traces:

### Child traces

A trace can have child traces. For example, a trace of type `archive` can have children, where each child is a trace
that represents an entry in the archive.

With an extraction plugin it is possible to create child traces for a trace that is being processed. New properties,
data streams, and other child traces can be set on the new child traces. When a child trace is created, the plugin
should provide a `name` for the child trace. The `id` of the child trace is generated, in the following
form: `parenttraceid-childnumber`. For example, if the parent has an id `0-0-0-0-0:0-9`, the first child gets the
id `0-0-0-0-0:0-9-1`, the second child gets the id `0-0-0-0-0:0-9-2`, and so on.

Note that a trace does not have (direct) access to its parent trace.

### Trace property types

The SDK supports the following property types for traces:

|             | Java                  | Python                |
|-------------|-----------------------|-----------------------|
| binary      | `byte[]`              | `bytes` / `bytearray` |
| boolean     | `boolean`             | `bool`                |
| integer     | `int` / `long`        | `int`                 |
| real        | `float` / `double`    | `float`               |
| string      | `String`              | `str`                 |
| date & time | `Date`                | `datetime`            |
| list        | `List`                | `list` / `tuple`      |
| mapping     | `Map`                 | `dict`                |
| location    | `LatLong`             | `GeographicLocation`  |
| vector      | `Vector`              | `Vector`              |
| tracelet    | *see Tracelets below* | *see Tracelets below* |

Both location and vector types are available from the SDK, Java package `org.hansken.plugin.extraction.api` or Python module `hansken.util`.

.. _vector:

#### Vector

A vector is a data type that can be used to store points in n-dimensional space as an array of floating point values.
Once indexed, the vectors can then be used in a gui or other client to search for traces that have a nearby vectors.
For example, it is possible to use a neural network that provides embeddings of human faces as vectors. Once indexed,
the vectors can then be used to find pictures with similar faces. To do this, the search rest api can be used to 
sort by the euclidean- or manhattan distance, or cosine similarity to a given vector.

.. _tracelets:

#### Tracelets

A Tracelet is a bundle of property values that belong to a single type. It is a property on a trace that can have
multiple properties itself, making it a list of key/value pairs. The API doesn't specify the cardinality, but the
implementation is limited to cardinality 'Few'. In Hansken these are called FVT's (Few Valued Types).

.. note::  MVT's (Many Valued Types) are currently not supported in the SDK and will be added in a future release.

An example of a tracelet is the `prediction` property, which describes a category or class a trace belongs to. It is
possible for a trace to have multiple predictions. Therefore `prediction` is a tracelet. Other examples of
tracelets are `identity` and `collection`.

Examples in code can be found here:

* Adding tracelets in :ref:`Java <tracelets java>`
* Adding tracelets in :ref:`Python <tracelets python>`

.. _Hansken trace model:

## Hansken trace model

All traces in Hansken are based on a specific version of the trace model, and must comply to that version of the trace
model. This is a nested data structure composed of origins, categories, types and properties.

All non-inrinsic trace properties are optional and are grouped by **type**. These types are defined under the trace
model section 'categories'. Every **category** has a list of allowed types. When a trace is identified as being a
document, it will get this set of predefined document properties. Trace types can have different **origins**. The
possible origins are defined in the trace model section 'origins'. An example of this is the processed types that are
always generated by the system during an extraction.

The details of the current trace model can be retrieved using the `/tracemodel`
REST call on the Gatekeeper endpoint of Hansken, or check the Hansken Documentation on the trace model.

.. TODO: nice screenshot on how to find the trace model in Hansken?

.. _Trace model and the extraction plugin SDK:

### Trace model and the extraction plugin SDK

.. warning:: The extraction plugin SDK has no knowledge of the trace model

The Extraction Plugins SDK has no knowledge of the trace model at this time. It is however possible to create new traces
with plugins. If any newly created Traces don't comply to the model, Hansken will not accept them and mark the plugin
execution as failed. The Extraction Plugins SDK and the provided [Test Framework](test_framework.md) don't check this.
Please make sure to use the right naming when creating new Traces, as provided by the trace model.

If an erroneous trace property is set, Hansken will show an error. The error can be found in the Hansken Expert UI
interface by double-clicking on the trace. Then the trace details screen will be opened and the error will be displayed
as follows:

.. image:: toolrun_error.png

This error describes that a property does not exist in the trace model. To get more information about the error, the
extraction log can be viewed. In the extraction log you have to search
for `java.lang.IllegalArgumentException: no such type` to find out which property is not supported by the trace model.

In the example extraction log below, the property `this_property_does_not_exist` could not be found 681 times.

    Cumulative warnings, based on the message without numbers, uuids and trace objects. Only showing full message for first warning of this type.
          Count | Key                                                   | Message
            681 | org.hansken.ep.shade.io.grpc.StatusRuntimeException   | CANCELLED: Cancelled by client with StreamObserver.onError(); org.hansken.ep.shade.io.grpc.StatusRuntimeException: ABORTED: java.lang.IllegalArgumentException: no such type: this_property_does_not_exist
              7 | java.lang.IllegalStateException                       | call was cancelled
              1 | org.hansken.ep.shade.io.grpc.StatusRuntimeException   | UNAVAILABLE: HTTP/2 error code: NO_ERROR Received Rst Stream