# Python API Changelog This document summarizes all important API changes in the Extraction Plugin API. This document only shows changes that are important to plugin developers. For a full list of changes per version, please refer to the general :ref:`changelog `. .. If present, remove `..` before `## |version|` if you create a new entry after a previous release. ## |version| * This version introduces a new docker image build utility `label_plugin`. This utility will eventually replace `build_plugin`. `build_plugin` is now deprecated. `label_plugin` is a utility to add labels to an extraction plugin image. Labeling a plugin is required for Hansken to detect extraction plugins in a plugin image registry. To label a plugin, first build the plugin image with [docker build](https://docs.docker.com/reference/cli/docker/image/build/); for example by using one of the following commands: ```shell docker build . -t my_plugin docker build . -t my_plugin --build-arg https_proxy=http://your_proxy:8080 ``` Next, run the `label_plugin` utility to label the build plugin container: ```shell label_plugin my_plugin ``` The result of `label_plugin` is a plugin image that can be :ref:`uploaded to Hansken`. `label_plugin` is preferred over `build_plugin`, as it does not require a full (virtual) environment with all plugin dependencies and resources. This is especially preferred when the plugin uses (big) data models or (external) dependencies. For usage read further in [packaging](packaging.md). ## 0.7.0 * Escaping the `/` character in matchers is optional. This simplifies and aims for better HQL and HQL-Lite compatability. See for more information and examples the :ref:`HQL-Lite syntax documentation`. Examples: * Old: `file.path:\/Users\/*\/AppData` -> new: `file.path:/Users/*/AppData` * Old: `file.path:\\/Users\\/*\\/AppData` -> new: `file.path:/Users/*/AppData` * Old: `registryEntry.key:\/Software\/Dropbox\/ks*\/Client-p` -> new: `registryEntry.key:/Software/Dropbox/ks*/Client-p` * Hansken returns `file.path` properties (outside the scope of matchers) as a `String` property, instead of a list of strings. Example: `trace.get('file.path')` now returns `'/dev/null'`, this was `['dev', 'null']`. * Improved plugin loading when using `serve_plugin` and `build_plugin`: `import` statements now work for modules (python files) that are located the same directory structure of a plugin. * A plugin can now stream data to a trace using `trace.open(mode='wb')`. This removes the limit on the size of data that could be written. See also :ref:`the python code snippet`. Example: ```python with trace.open(mode='wb') as writer: writer.write(b'a string') writer.write(bytes(another_string, 'utf-8')) ``` _note_: this does not work when using `run_with_hanskenpy`. ## 0.6.1 * The docker image build script `build_plugin` has been updated to allow for extension of the docker command. This can be especially handy for specifying a proxy. You should build your plugin container image with the following command: ```bash build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY [DOCKER_IMAGE_NAME] [DOCKER_ARGS] ``` .. warning:: Note that the `DOCKER_IMAGE_NAME` argument no longer requires a `-n` parameter to be specified. For usage read further in [packaging](packaging.md). ## 0.6.0 .. warning:: This is an API breaking change. Upgrading your plugin to this version will require code changes. Plugins built with previous versions of the SDK from `0.3.0` will still work with Hansken. .. warning:: It is strongly recommended to upgrade your plugins to this new version because it significantly improves the start-up time of Hansken. See the migration steps below. This release contains both build pipeline changes and API changes. Please read all changes carefully. ### Build pipeline change * Extraction plugin container images are now labeled with PluginInfo. This allows Hansken to efficiently load extraction plugins. Migration steps from earlier versions: 1. Update the SDK version in your `setup.py` / `requirements.txt` 2. If you come from a version prior to `0.4.0`, or if you use a plugin name instead of a plugin id in your `pluginInfo()`, switch to the plugin id style (read instructions for version `0.4.0`) 3. Update your build scripts to build your plugin (Docker) container image. Be sure to [have the Extraction Plugins SDK installed](getting_started.md#Installation). Then, you should build your plugin container image with the following command: ```bash build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY -n [DOCKER_IMAGE_NAME] ``` For example: ```bash build_plugin plugin/chatplugin.py . -n extraction-plugins/chatplugin ``` This will generate a plugin image: * The extraction plugin is added to your local image registry (`docker images`), * Note that DOCKER\_IMAGE\_NAME is optional and will default to `extraction-plugin/PLUGINID`, e.g. `extraction-plugin/nfi.nl/extract/chat/whatsapp`, * The image is tagged with two tags: `latest`, and your plugin version. ### API changes * The field `plugin` has been removed from `PluginInfo`. * The field `pluginId` should now be the first argument of PluginInfo (when using unnamed arguments). Old (unnamed arguments): ```python def plugin_info(self): return PluginInfo(self, '1.0.0', 'description', author, MaturityLevel.PROOF_OF_CONCEPT, '*, 'https://hansken.org', PluginId(...), 'Apache License 2.0') ``` New (removed `self`, and moved `PluginId(...)` to first argument position): ```python def plugin_info(self): return PluginInfo(PluginId(...), '1.0.0', 'description', author, MaturityLevel.PROOF_OF_CONCEPT, '*', 'https://hansken.org', 'Apache License 2.0') ``` Old (named arguments): ```python def plugin_info(self): return PluginInfo(plugin=self, version='1.0.0', ...) ``` New (removed `plugin=self`): ```python def plugin_info(self): return PluginInfo(version='1.0.0', ...) ``` * Plugin `data_context.data_size` is now a variable instead of a method: Old: ```python def process(self, trace: ExtractionTrace, data_context: DataContext): size = data_context.data_size() ``` New: ```python def process(self, trace: ExtractionTrace, data_context: DataContext): size = data_context.data_size ``` * Simplify declaring required runtime resources in a plugin's info. Extraction plugin resources don't use the builder pattern anymore. Old: ```python return PluginInfo( ..., resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build()) ) ``` New: ```python # no need for a builder, declare resources by direct instantiation return PluginInfo( ..., resources=PluginResources(maximum_cpu=2.0, maximum_memory=2048) ) # or, as before, specify just on resource return PluginInfo( ..., resources=PluginResources(maximum_memory=4096) ) ``` ## 0.5.1 * Simplify tracelet properties by making the tracelet type prefix optional. ```python # using a Tracelet object trace.add_tracelet(Tracelet("prediction", { "type": "example", "confidence": 0.8 })) # or without a Tracelet object trace.add_tracelet("identity", {"name": "John Doe", "status": "online"}) ``` * Enabled _manual_ plugin testing, as described on :ref:`advanced use of the test framework in Python`. ## 0.5.0 * Support vector data type in trace properties. ```python embedding = Vector.from_sequence((width, height)) tracelet = Tracelet("prediction", { "prediction.type": "example-vector", "prediction.embedding": embedding }) trace.add_tracelet(tracelet) ``` ## 0.4.13 * When writing input search traces for tests, it is no longer required to explicitly set an `id` property. These are automatically generated when executing tests. ## 0.4.7 * More `$data` matchers are supported in Hansken.py plugin runner. Before this improvement it was only possible to match on `$data.type`. Now it is also possible to match for example on `$data.mimeType` and `$data.mimeClass`. The `$data` matcher should still be at the end of the query as before. ## 0.4.6 * It is now possible to specify maximum system resources in the `PluginInfo`. To run a plugin with 0.5 cpu (= 0.5 vCPU/Core/hyperthread) and 1 gb memory, for example, the following configuration can be added to `PluginInfo`: ```python plugin_info = PluginInfo(..., resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build()) ``` ## 0.4.0 * Extraction Plugins are now identified with a `PluginInfo.PluginId` containing a domain, category and name. The method `PluginInfo.name(pluginName)` has been replaced by `PluginInfo.id(new PluginId(domain, category, name)`. More details on the plugin naming conventions can be found at the :doc:`../concepts/plugin_naming_convention` section. * `PluginInfo.name()` is now deprecated (but will still work for backwards compatibility). * A new license field `PluginInfo.license` has also been added in this release. * The following example creates a PluginInfo for a plugin with the name `TestPlugin`, licensed under the `Apache License 2.0` license: ```python class TestPlugin(ExtractionPlugin): def plugin_info(self) -> PluginInfo: return PluginInfo(self, version='1.0.0', description='A plugin for testing.', author=Author('The Externals', 'tester@holmes.nl', 'NFI'), maturity=MaturityLevel.PROOF_OF_CONCEPT, webpage_url='https://hansken.org', matcher='file.extension=txt', id=PluginId(domain='nfi.nl', category='test', name='TestPlugin'), license='Apache License 2.0' ) ``` ## 0.3.0 * Extraction Plugins can now create new datastreams on a Trace through data transformations. Data transformations describe how data can be obtained from a source. An example case is an extraction plugin that processes an archive file. The plugin creates a child trace per entry in the archive file. Each child trace will have a datastream that is a transformation that marks the start and length of the entry in the original archive data. By just describing the data instead of specifying the actual data, a lot of space is saved. Although Hansken supports various transformations, the Extraction Plugins SDK for now only supports ranged data transformations. Ranged data transformations define data as a list of ranges, each range with an offset and length in a bytearray. The following example sets a new datastream with dataType `html` on a trace, by setting a ranged data transformation: ```python trace.add_transformation('html', RangedTransformation(Range(offset, length))) ``` The following example creates a child trace and sets a new datastream with dataType `raw` on it, by setting a ranged data transformation with two ranges: ```python child = trace.child_builder('new trace') child.add_transformation('raw', RangedTransformation.builder() .add_range(10, 20) .add_range(50, 30) .build()) }); ``` More detailed documentation will follow in an upcoming SDK release. ## 0.2.0 .. warning:: This is an API breaking change. Plugins created with an earlier version of the extraction plugin SDK are not compatible with Hansken that uses `0.2.0` or later. * Introduced a new extraction plugin type `api.extraction_plugin.DeferredExtractioPlugin`. Deferred Extraction plugins can be run at a different extraction stage. This type of plugin also allows accessing other traces using the searcher. * The class `api.extraction_context.ExtractionContext` has been renamed to `api.data_context.DataContext`. The new name `DataContext` represents the class contents better. Plugins have to update matching import statements accordingly. Plugins should also update the named argument `context` to `data_context` of the plugin `process()` method. This change has no functional changes. Old: ```python from hansken_extraction_plugin.api.extraction_context import ExtractionContext def process(self, trace, context): pass ``` New: ```python from hansken_extraction_plugin.api.data_context import DataContext def process(self, trace, data_context): pass ``` * Moved `api.author.Author` to `api.plugin_info.Author`, and moved `api.maturity_level.MaturityLevel` to `api.plugin_info.MaturityLevel` This is a more *pythonic* way of grouping of classes into modules. This change has no functional side effects. Plugins have to update matching import statements accordingly. Old: ```python from hansken_extraction_plugin.api.author import Author from hansken_extraction_plugin.api.maturity_level import MaturityLevel from hansken_extraction_plugin.api.plugin_info import PluginInfo ``` New: ```python from hansken_extraction_plugin.api.plugin_info import Author, MaturityLevel, PluginInfo ``` * Removed `DataContext.get_first_bytes()` from the public API. * Removed `api.extraction_trace.validate_update_arguments(..)` from the public API. This method is still invoked implicitly when setting trace properties.