# Python API Changelog This document summarizes all important API changes in the Extraction Plugin API. This document only shows changes that are important to plugin developers. For a full list of changes per version, please refer to the general :ref:`changelog `. .. If present, remove `..` before `## |version|` if you create a new entry after a previous release. .. ## |version| ## 0.9.4 * This release supports providing all types of transformer arguments when using the `execute_transformer` script. ## 0.9.2 * This release introduces a simple flow-control mechanism that fixes `connection RST` errors that can occur when a plugin produces traces too fast. Plugins built with this version are **not backwards compatible**. * Python plugins now support transformers, remote methods which can be executed using the Hansken REST API. More information can be found in [the docs](transformers.md). * The `build_plugin` and `label_plugin` utilities prematurely shut down containers if the building and labeling process takes too long causing the process for slow containers to fail. If your plugin takes a long time to start, you may want to increase the timeout before the script stops trying to connect and aborts the process of building the plugin. This can be done using the new optional `--timeout` argument. The default is set to 30 seconds. * The optional image name argument of `build_plugin` is changed to a flag. Build scripts can be updated using `--target-name DOCKER_IMAGE_NAME`. ## 0.8.3 * This release addresses important load balancing issues. Please use release `0.8.3` as a drop-in-replacement for releases `0.8.2` and `0.8.1`. ## 0.8.2 * ⚠️ This release is deprecated, please upgrade to `0.8.3` * The `build_plugin` utility has been updated and the deprecation status has been removed. As with `label_plugin`, `build_plugin` now no longer requires a full (virtual) environment with all plugin dependencies and resources. This will greatly reduce build times for plugins with big dependencies and/or large models. The first argument of the command (a pointer to your `plugin.py` file) has been removed. Please do not forget to remove the first argument of `build_plugin` in your `tox.ini` or other build tooling. For usage read further in [packaging](packaging.md). * The default read-buffer of `trace.open('rb')` as been changed from 1 Megabyte to 6 Megabyte to reduce overhead while data reading. * The data stream writer of `trace.open('wb')` is now buffered as well. This means that multiple small writes will be flushed after every 6 Megabytes of data has been written (or when the writer is closed). * The read-buffer or write-buffer size can be overridden by the user, by passing the `buffer_size=` argument to `trace.open()`: ```python with trace.open('rb', buffer_size=1024*1024): # set a 1 Megabyte buffer size pass with trace.open('wb', buffer_size=1024*1024*12): # set a 12 Megabyte buffer size pass with trace.open('wb', buffer_size=1): # a buffer_size of 1 effectively disables the buffer: pass # each write will be flushed to Hansken directly ``` * It is now possible to write `str` values to `trace.open(..)`. To do so, pass `mode='w'` as additional argument. By default, it is assumed that the written text is 'utf-8' encoded. The default can be overwritten by using the `'encoding='` argument. In a future Hansken update, Hansken will set the correct data-stream properties for your text stream (`mimeType`, `mimeClass`, and `fileType`). Example use cases are: * write picture-to-text (OCR) data to a trace * write translations to a trace * write audio-to-text (audio transcriptions) to a trace * write the results of a JSON dump, e.g.: `json.dump(['your', 'data'], text_writer)` Examples in code: ```python with trace.open(data_type='raw', mode='w', encoding='utf-8') as text_writer: text_writer.write('hello.world') # write strings directly to it json.dump({'hello': 'world'}, text_writer) # or pass the writer to json.dump ``` See also :ref:`the python code snippet`. ## 0.8.1 * ⚠️ This release is deprecated, please upgrade to `0.8.3` ## 0.8.0 * The trace property `imageId` is renamed to `image`. This is to be in line with the Hansken REST API and Python API. When updating your plugin, please update your calls `trace.get('imageId')` to `trace.get('image')`. * [#774](https://git.eminjenv.nl/hansken/hbacklog/-/issues/774) By default, deferred extraction plugin searches are now scoped to the image of the trace that is currently being processed. Optionally, a project-wide search can be done by passing an optional scope argument. ```python def process(trace, data_context, searcher): # only search for traces inside the same image as the trace that is being processed searcher.search('*') searcher.search('*', scope='image') # explicit alternative, using a str searcher.search('*', scope=SearchScope.image) # explicit alternative, using the SearchScope enum # only search for traces inside the same image as the trace that is being processed searcher.search('*', scope='project') searcher.search('*', scope=SearchScope.project) ``` * Support trace properties of type `list[float]`. This enables you to write multiple offsets and confidence scores in tracelets of type prediction. For example: ```python trace.add_tracelet('prediction', { 'modelName': 'my_cat_detector', 'modelVersion': '0.0.BETA', 'type': 'classification', 'label': 'cat', # the best score 'offset': 3.0, 'confidence': 0.4, # all scores 'offsets': [0.0, 3.0, 6.0, 9.0], 'confidences': [0.1, 0.4, 0.03, 0.09], }) ``` ## 0.7.3 * This version introduces a new docker image build utility `label_plugin`. This utility will eventually replace `build_plugin`. `build_plugin` is now deprecated. `label_plugin` is a utility to add labels to an extraction plugin image. Labeling a plugin is required for Hansken to detect extraction plugins in a plugin image registry. To label a plugin, first build the plugin image with [docker build](https://docs.docker.com/reference/cli/docker/image/build/); for example by using one of the following commands: ```shell docker build . -t my_plugin docker build . -t my_plugin --build-arg https_proxy=http://your_proxy:8080 ``` Next, run the `label_plugin` utility to label the build plugin container: ```shell label_plugin my_plugin ``` The result of `label_plugin` is a plugin image that can be :ref:`uploaded to Hansken`. `label_plugin` is preferred over `build_plugin`, as it does not require a full (virtual) environment with all plugin dependencies and resources. This is especially preferred when the plugin uses (big) data models or (external) dependencies. For usage read further in [packaging](packaging.md). ## 0.7.0 * Escaping the `/` character in matchers is optional. This simplifies and aims for better HQL and HQL-Lite compatability. See for more information and examples the :ref:`HQL-Lite syntax documentation`. Examples: * Old: `file.path:\/Users\/*\/AppData` -> new: `file.path:/Users/*/AppData` * Old: `file.path:\\/Users\\/*\\/AppData` -> new: `file.path:/Users/*/AppData` * Old: `registryEntry.key:\/Software\/Dropbox\/ks*\/Client-p` -> new: `registryEntry.key:/Software/Dropbox/ks*/Client-p` * Hansken returns `file.path` properties (outside the scope of matchers) as a `String` property, instead of a list of strings. Example: `trace.get('file.path')` now returns `'/dev/null'`, this was `['dev', 'null']`. * Improved plugin loading when using `serve_plugin` and `build_plugin`: `import` statements now work for modules (python files) that are located the same directory structure of a plugin. * A plugin can now stream data to a trace using `trace.open(mode='wb')`. This removes the limit on the size of data that could be written. See also :ref:`the python code snippet`. Example: ```python with trace.open(mode='wb') as writer: writer.write(b'a string') writer.write(bytes(another_string, 'utf-8')) ``` _note_: this does not work when using `run_with_hanskenpy`. ## 0.6.1 * The docker image build script `build_plugin` has been updated to allow for extension of the docker command. This can be especially handy for specifying a proxy. You should build your plugin container image with the following command: ```bash build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY [DOCKER_IMAGE_NAME] [DOCKER_ARGS] ``` .. warning:: Note that the `DOCKER_IMAGE_NAME` argument no longer requires a `-n` parameter to be specified. For usage read further in [packaging](packaging.md). ## 0.6.0 .. warning:: This is an API breaking change. Upgrading your plugin to this version will require code changes. Plugins built with previous versions of the SDK from `0.3.0` will still work with Hansken. .. warning:: It is strongly recommended to upgrade your plugins to this new version because it significantly improves the start-up time of Hansken. See the migration steps below. This release contains both build pipeline changes and API changes. Please read all changes carefully. ### Build pipeline change * Extraction plugin container images are now labeled with PluginInfo. This allows Hansken to efficiently load extraction plugins. Migration steps from earlier versions: 1. Update the SDK version in your `setup.py` / `requirements.txt` 2. If you come from a version prior to `0.4.0`, or if you use a plugin name instead of a plugin id in your `pluginInfo()`, switch to the plugin id style (read instructions for version `0.4.0`) 3. Update your build scripts to build your plugin (Docker) container image. Be sure to [have the Extraction Plugins SDK installed](getting_started.md#Installation). Then, you should build your plugin container image with the following command: ```bash build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY -n [DOCKER_IMAGE_NAME] ``` For example: ```bash build_plugin plugin/chatplugin.py . -n extraction-plugins/chatplugin ``` This will generate a plugin image: * The extraction plugin is added to your local image registry (`docker images`), * Note that DOCKER\_IMAGE\_NAME is optional and will default to `extraction-plugin/PLUGINID`, e.g. `extraction-plugin/nfi.nl/extract/chat/whatsapp`, * The image is tagged with two tags: `latest`, and your plugin version. ### API changes * The field `plugin` has been removed from `PluginInfo`. * The field `pluginId` should now be the first argument of PluginInfo (when using unnamed arguments). Old (unnamed arguments): ```python def plugin_info(self): return PluginInfo(self, '1.0.0', 'description', author, MaturityLevel.PROOF_OF_CONCEPT, '*, 'https://hansken.org', PluginId(...), 'Apache License 2.0') ``` New (removed `self`, and moved `PluginId(...)` to first argument position): ```python def plugin_info(self): return PluginInfo(PluginId(...), '1.0.0', 'description', author, MaturityLevel.PROOF_OF_CONCEPT, '*', 'https://hansken.org', 'Apache License 2.0') ``` Old (named arguments): ```python def plugin_info(self): return PluginInfo(plugin=self, version='1.0.0', ...) ``` New (removed `plugin=self`): ```python def plugin_info(self): return PluginInfo(version='1.0.0', ...) ``` * Plugin `data_context.data_size` is now a variable instead of a method: Old: ```python def process(self, trace: ExtractionTrace, data_context: DataContext): size = data_context.data_size() ``` New: ```python def process(self, trace: ExtractionTrace, data_context: DataContext): size = data_context.data_size ``` * Simplify declaring required runtime resources in a plugin's info. Extraction plugin resources don't use the builder pattern anymore. Old: ```python return PluginInfo( ..., resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build()) ) ``` New: ```python # no need for a builder, declare resources by direct instantiation return PluginInfo( ..., resources=PluginResources(maximum_cpu=2.0, maximum_memory=2048) ) # or, as before, specify just on resource return PluginInfo( ..., resources=PluginResources(maximum_memory=4096) ) ``` ## 0.5.1 * Simplify tracelet properties by making the tracelet type prefix optional. ```python # using a Tracelet object trace.add_tracelet(Tracelet("prediction", { "type": "example", "confidence": 0.8 })) # or without a Tracelet object trace.add_tracelet("identity", {"name": "John Doe", "status": "online"}) ``` * Enabled _manual_ plugin testing, as described on :ref:`advanced use of the test framework in Python`. ## 0.5.0 * Support vector data type in trace properties. ```python embedding = Vector.from_sequence((width, height)) tracelet = Tracelet("prediction", { "prediction.type": "example-vector", "prediction.embedding": embedding }) trace.add_tracelet(tracelet) ``` ## 0.4.13 * When writing input search traces for tests, it is no longer required to explicitly set an `id` property. These are automatically generated when executing tests. ## 0.4.7 * More `$data` matchers are supported in Hansken.py plugin runner. Before this improvement it was only possible to match on `$data.type`. Now it is also possible to match for example on `$data.mimeType` and `$data.mimeClass`. The `$data` matcher should still be at the end of the query as before. ## 0.4.6 * It is now possible to specify maximum system resources in the `PluginInfo`. To run a plugin with 0.5 cpu (= 0.5 vCPU/Core/hyperthread) and 1 gb memory, for example, the following configuration can be added to `PluginInfo`: ```python plugin_info = PluginInfo(..., resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build()) ``` ## 0.4.0 * Extraction Plugins are now identified with a `PluginInfo.PluginId` containing a domain, category and name. The method `PluginInfo.name(pluginName)` has been replaced by `PluginInfo.id(new PluginId(domain, category, name)`. More details on the plugin naming conventions can be found at the :doc:`../concepts/plugin_naming_convention` section. * `PluginInfo.name()` is now deprecated (but will still work for backwards compatibility). * A new license field `PluginInfo.license` has also been added in this release. * The following example creates a PluginInfo for a plugin with the name `TestPlugin`, licensed under the `Apache License 2.0` license: ```python class TestPlugin(ExtractionPlugin): def plugin_info(self) -> PluginInfo: return PluginInfo(self, version='1.0.0', description='A plugin for testing.', author=Author('The Externals', 'tester@holmes.nl', 'NFI'), maturity=MaturityLevel.PROOF_OF_CONCEPT, webpage_url='https://hansken.org', matcher='file.extension=txt', id=PluginId(domain='nfi.nl', category='test', name='TestPlugin'), license='Apache License 2.0' ) ``` ## 0.3.0 * Extraction Plugins can now create new datastreams on a Trace through data transformations. Data transformations describe how data can be obtained from a source. An example case is an extraction plugin that processes an archive file. The plugin creates a child trace per entry in the archive file. Each child trace will have a datastream that is a transformation that marks the start and length of the entry in the original archive data. By just describing the data instead of specifying the actual data, a lot of space is saved. Although Hansken supports various transformations, the Extraction Plugins SDK for now only supports ranged data transformations. Ranged data transformations define data as a list of ranges, each range with an offset and length in a bytearray. The following example sets a new datastream with dataType `html` on a trace, by setting a ranged data transformation: ```python trace.add_transformation('html', RangedTransformation(Range(offset, length))) ``` The following example creates a child trace and sets a new datastream with dataType `raw` on it, by setting a ranged data transformation with two ranges: ```python child = trace.child_builder('new trace') child.add_transformation('raw', RangedTransformation.builder() .add_range(10, 20) .add_range(50, 30) .build()) }); ``` More detailed documentation will follow in an upcoming SDK release. ## 0.2.0 .. warning:: This is an API breaking change. Plugins created with an earlier version of the extraction plugin SDK are not compatible with Hansken that uses `0.2.0` or later. * Introduced a new extraction plugin type `api.extraction_plugin.DeferredExtractioPlugin`. Deferred Extraction plugins can be run at a different extraction stage. This type of plugin also allows accessing other traces using the searcher. * The class `api.extraction_context.ExtractionContext` has been renamed to `api.data_context.DataContext`. The new name `DataContext` represents the class contents better. Plugins have to update matching import statements accordingly. Plugins should also update the named argument `context` to `data_context` of the plugin `process()` method. This change has no functional changes. Old: ```python from hansken_extraction_plugin.api.extraction_context import ExtractionContext def process(self, trace, context): pass ``` New: ```python from hansken_extraction_plugin.api.data_context import DataContext def process(self, trace, data_context): pass ``` * Moved `api.author.Author` to `api.plugin_info.Author`, and moved `api.maturity_level.MaturityLevel` to `api.plugin_info.MaturityLevel` This is a more *pythonic* way of grouping of classes into modules. This change has no functional side effects. Plugins have to update matching import statements accordingly. Old: ```python from hansken_extraction_plugin.api.author import Author from hansken_extraction_plugin.api.maturity_level import MaturityLevel from hansken_extraction_plugin.api.plugin_info import PluginInfo ``` New: ```python from hansken_extraction_plugin.api.plugin_info import Author, MaturityLevel, PluginInfo ``` * Removed `DataContext.get_first_bytes()` from the public API. * Removed `api.extraction_trace.validate_update_arguments(..)` from the public API. This method is still invoked implicitly when setting trace properties.