Python API Changelog
This document summarizes all important API changes in the Extraction Plugin API. This document only shows changes that are important to plugin developers. For a full list of changes per version, please refer to the general changelog.
0.8.3
This release addresses important load balancing issues. Please use release
0.8.3
as a drop-in-replacement for releases0.8.2
and0.8.1
.
0.8.2
⚠️ This release is deprecated, please upgrade to
0.8.3
The
build_plugin
utility has been updated and the deprecation status has been removed. As withlabel_plugin
,build_plugin
now no longer requires a full (virtual) environment with all plugin dependencies and resources. This will greatly reduce build times for plugins with big dependencies and/or large models.The first argument of the command (a pointer to your
plugin.py
file) has been removed. Please do not forget to remove the first argument ofbuild_plugin
in yourtox.ini
or other build tooling.For usage read further in packaging.
The default read-buffer of
trace.open('rb')
as been changed from 1 Megabyte to 6 Megabyte to reduce overhead while data reading.The data stream writer of
trace.open('wb')
is now buffered as well. This means that multiple small writes will be flushed after every 6 Megabytes of data has been written (or when the writer is closed).The read-buffer or write-buffer size can be overridden by the user, by passing the
buffer_size=
argument totrace.open()
:with trace.open('rb', buffer_size=1024*1024): # set a 1 Megabyte buffer size pass with trace.open('wb', buffer_size=1024*1024*12): # set a 12 Megabyte buffer size pass with trace.open('wb', buffer_size=1): # a buffer_size of 1 effectively disables the buffer: pass # each write will be flushed to Hansken directly
It is now possible to write
str
values totrace.open(..)
. To do so, passmode='w'
as additional argument. By default, it is assumed that the written text is ‘utf-8’ encoded. The default can be overwritten by using the'encoding='
argument.In a future Hansken update, Hansken will set the correct data-stream properties for your text stream (
mimeType
,mimeClass
, andfileType
).Example use cases are:
write picture-to-text (OCR) data to a trace
write translations to a trace
write audio-to-text (audio transcriptions) to a trace
write the results of a JSON dump, e.g.:
json.dump(['your', 'data'], text_writer)
Examples in code:
with trace.open(data_type='raw', mode='w', encoding='utf-8') as text_writer: text_writer.write('hello.world') # write strings directly to it json.dump({'hello': 'world'}, text_writer) # or pass the writer to json.dump
See also the python code snippet.
0.8.1
⚠️ This release is deprecated, please upgrade to
0.8.3
0.8.0
The trace property
imageId
is renamed toimage
. This is to be in line with the Hansken REST API and Python API. When updating your plugin, please update your callstrace.get('imageId')
totrace.get('image')
.#774 By default, deferred extraction plugin searches are now scoped to the image of the trace that is currently being processed. Optionally, a project-wide search can be done by passing an optional scope argument.
def process(trace, data_context, searcher): # only search for traces inside the same image as the trace that is being processed searcher.search('*') searcher.search('*', scope='image') # explicit alternative, using a str searcher.search('*', scope=SearchScope.image) # explicit alternative, using the SearchScope enum # only search for traces inside the same image as the trace that is being processed searcher.search('*', scope='project') searcher.search('*', scope=SearchScope.project)
Support trace properties of type
list[float]
. This enables you to write multiple offsets and confidence scores in tracelets of type prediction.For example:
trace.add_tracelet('prediction', { 'modelName': 'my_cat_detector', 'modelVersion': '0.0.BETA', 'type': 'classification', 'label': 'cat', # the best score 'offset': 3.0, 'confidence': 0.4, # all scores 'offsets': [0.0, 3.0, 6.0, 9.0], 'confidences': [0.1, 0.4, 0.03, 0.09], })
0.7.3
This version introduces a new docker image build utility
label_plugin
. This utility will eventually replacebuild_plugin
.build_plugin
is now deprecated.label_plugin
is a utility to add labels to an extraction plugin image. Labeling a plugin is required for Hansken to detect extraction plugins in a plugin image registry.To label a plugin, first build the plugin image with docker build; for example by using one of the following commands:
docker build . -t my_plugin docker build . -t my_plugin --build-arg https_proxy=http://your_proxy:8080
Next, run the
label_plugin
utility to label the build plugin container:label_plugin my_plugin
The result of
label_plugin
is a plugin image that can be uploaded to Hansken.label_plugin
is preferred overbuild_plugin
, as it does not require a full (virtual) environment with all plugin dependencies and resources. This is especially preferred when the plugin uses (big) data models or (external) dependencies.For usage read further in packaging.
0.7.0
Escaping the
/
character in matchers is optional. This simplifies and aims for better HQL and HQL-Lite compatability. See for more information and examples the HQL-Lite syntax documentation.Examples:
Old:
file.path:\/Users\/*\/AppData
-> new:file.path:/Users/*/AppData
Old:
file.path:\\/Users\\/*\\/AppData
-> new:file.path:/Users/*/AppData
Old:
registryEntry.key:\/Software\/Dropbox\/ks*\/Client-p
-> new:registryEntry.key:/Software/Dropbox/ks*/Client-p
Hansken returns
file.path
properties (outside the scope of matchers) as aString
property, instead of a list of strings. Example:trace.get('file.path')
now returns'/dev/null'
, this was['dev', 'null']
.Improved plugin loading when using
serve_plugin
andbuild_plugin
:import
statements now work for modules (python files) that are located the same directory structure of a plugin.A plugin can now stream data to a trace using
trace.open(mode='wb')
. This removes the limit on the size of data that could be written. See also the python code snippet.Example:
with trace.open(mode='wb') as writer: writer.write(b'a string') writer.write(bytes(another_string, 'utf-8'))
note: this does not work when using
run_with_hanskenpy
.
0.6.1
The docker image build script
build_plugin
has been updated to allow for extension of the docker command. This can be especially handy for specifying a proxy. You should build your plugin container image with the following command:build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY [DOCKER_IMAGE_NAME] [DOCKER_ARGS]
Warning
Note that the
DOCKER_IMAGE_NAME
argument no longer requires a-n
parameter to be specified.For usage read further in packaging.
0.6.0
Warning
This is an API breaking change. Upgrading your plugin to this version will require code changes. Plugins built with previous versions of the SDK from 0.3.0 will still work with Hansken.
Warning
It is strongly recommended to upgrade your plugins to this new version because it significantly improves the start-up time of Hansken. See the migration steps below.
This release contains both build pipeline changes and API changes. Please read all changes carefully.
Build pipeline change
Extraction plugin container images are now labeled with PluginInfo. This allows Hansken to efficiently load extraction plugins. Migration steps from earlier versions:
Update the SDK version in your
setup.py
/requirements.txt
If you come from a version prior to
0.4.0
, or if you use a plugin name instead of a plugin id in yourpluginInfo()
, switch to the plugin id style (read instructions for version0.4.0
)Update your build scripts to build your plugin (Docker) container image. Be sure to have the Extraction Plugins SDK installed. Then, you should build your plugin container image with the following command:
build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY -n [DOCKER_IMAGE_NAME]
For example:
build_plugin plugin/chatplugin.py . -n extraction-plugins/chatplugin
This will generate a plugin image:
The extraction plugin is added to your local image registry (
docker images
),Note that DOCKER_IMAGE_NAME is optional and will default to
extraction-plugin/PLUGINID
, e.g.extraction-plugin/nfi.nl/extract/chat/whatsapp
,The image is tagged with two tags:
latest
, and your plugin version.
API changes
The field
plugin
has been removed fromPluginInfo
.The field
pluginId
should now be the first argument of PluginInfo (when using unnamed arguments).Old (unnamed arguments):
def plugin_info(self): return PluginInfo(self, '1.0.0', 'description', author, MaturityLevel.PROOF_OF_CONCEPT, '*, 'https://hansken.org', PluginId(...), 'Apache License 2.0')
New (removed
self
, and movedPluginId(...)
to first argument position):def plugin_info(self): return PluginInfo(PluginId(...), '1.0.0', 'description', author, MaturityLevel.PROOF_OF_CONCEPT, '*', 'https://hansken.org', 'Apache License 2.0')
Old (named arguments):
def plugin_info(self): return PluginInfo(plugin=self, version='1.0.0', ...)
New (removed
plugin=self
):def plugin_info(self): return PluginInfo(version='1.0.0', ...)
Plugin
data_context.data_size
is now a variable instead of a method:Old:
def process(self, trace: ExtractionTrace, data_context: DataContext): size = data_context.data_size()
New:
def process(self, trace: ExtractionTrace, data_context: DataContext): size = data_context.data_size
Simplify declaring required runtime resources in a plugin’s info.
Extraction plugin resources don’t use the builder pattern anymore.
Old:
return PluginInfo( ..., resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build()) )
New:
# no need for a builder, declare resources by direct instantiation return PluginInfo( ..., resources=PluginResources(maximum_cpu=2.0, maximum_memory=2048) ) # or, as before, specify just on resource return PluginInfo( ..., resources=PluginResources(maximum_memory=4096) )
0.5.1
Simplify tracelet properties by making the tracelet type prefix optional.
# using a Tracelet object trace.add_tracelet(Tracelet("prediction", { "type": "example", "confidence": 0.8 })) # or without a Tracelet object trace.add_tracelet("identity", {"name": "John Doe", "status": "online"})
Enabled manual plugin testing, as described on advanced use of the test framework in Python.
0.5.0
Support vector data type in trace properties.
embedding = Vector.from_sequence((width, height)) tracelet = Tracelet("prediction", { "prediction.type": "example-vector", "prediction.embedding": embedding }) trace.add_tracelet(tracelet)
0.4.13
When writing input search traces for tests, it is no longer required to explicitly set an
id
property. These are automatically generated when executing tests.
0.4.7
More
$data
matchers are supported in Hansken.py plugin runner. Before this improvement it was only possible to match on$data.type
. Now it is also possible to match for example on$data.mimeType
and$data.mimeClass
. The$data
matcher should still be at the end of the query as before.
0.4.6
It is now possible to specify maximum system resources in the
PluginInfo
. To run a plugin with 0.5 cpu (= 0.5 vCPU/Core/hyperthread) and 1 gb memory, for example, the following configuration can be added toPluginInfo
:plugin_info = PluginInfo(..., resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
0.4.0
Extraction Plugins are now identified with a
PluginInfo.PluginId
containing a domain, category and name. The methodPluginInfo.name(pluginName)
has been replaced byPluginInfo.id(new PluginId(domain, category, name)
. More details on the plugin naming conventions can be found at the Plugin naming convention section.PluginInfo.name()
is now deprecated (but will still work for backwards compatibility).A new license field
PluginInfo.license
has also been added in this release.The following example creates a PluginInfo for a plugin with the name
TestPlugin
, licensed under theApache License 2.0
license:class TestPlugin(ExtractionPlugin): def plugin_info(self) -> PluginInfo: return PluginInfo(self, version='1.0.0', description='A plugin for testing.', author=Author('The Externals', 'tester@holmes.nl', 'NFI'), maturity=MaturityLevel.PROOF_OF_CONCEPT, webpage_url='https://hansken.org', matcher='file.extension=txt', id=PluginId(domain='nfi.nl', category='test', name='TestPlugin'), license='Apache License 2.0' )
0.3.0
Extraction Plugins can now create new datastreams on a Trace through data transformations. Data transformations describe how data can be obtained from a source.
An example case is an extraction plugin that processes an archive file. The plugin creates a child trace per entry in the archive file. Each child trace will have a datastream that is a transformation that marks the start and length of the entry in the original archive data. By just describing the data instead of specifying the actual data, a lot of space is saved.
Although Hansken supports various transformations, the Extraction Plugins SDK for now only supports ranged data transformations. Ranged data transformations define data as a list of ranges, each range with an offset and length in a bytearray.
The following example sets a new datastream with dataType
html
on a trace, by setting a ranged data transformation:trace.add_transformation('html', RangedTransformation(Range(offset, length)))
The following example creates a child trace and sets a new datastream with dataType
raw
on it, by setting a ranged data transformation with two ranges:child = trace.child_builder('new trace') child.add_transformation('raw', RangedTransformation.builder() .add_range(10, 20) .add_range(50, 30) .build()) });
More detailed documentation will follow in an upcoming SDK release.
0.2.0
Warning
This is an API breaking change. Plugins created with an earlier version of the extraction plugin SDK are not compatible with Hansken that uses 0.2.0 or later.
Introduced a new extraction plugin type
api.extraction_plugin.DeferredExtractioPlugin
. Deferred Extraction plugins can be run at a different extraction stage. This type of plugin also allows accessing other traces using the searcher.The class
api.extraction_context.ExtractionContext
has been renamed toapi.data_context.DataContext
. The new nameDataContext
represents the class contents better. Plugins have to update matching import statements accordingly. Plugins should also update the named argumentcontext
todata_context
of the pluginprocess()
method. This change has no functional changes.Old:
from hansken_extraction_plugin.api.extraction_context import ExtractionContext def process(self, trace, context): pass
New:
from hansken_extraction_plugin.api.data_context import DataContext def process(self, trace, data_context): pass
Moved
api.author.Author
toapi.plugin_info.Author
, and movedapi.maturity_level.MaturityLevel
toapi.plugin_info.MaturityLevel
This is a more pythonic way of grouping of classes into modules. This change has no functional side effects.Plugins have to update matching import statements accordingly.
Old:
from hansken_extraction_plugin.api.author import Author from hansken_extraction_plugin.api.maturity_level import MaturityLevel from hansken_extraction_plugin.api.plugin_info import PluginInfo
New:
from hansken_extraction_plugin.api.plugin_info import Author, MaturityLevel, PluginInfo
Removed
DataContext.get_first_bytes()
from the public API.Removed
api.extraction_trace.validate_update_arguments(..)
from the public API. This method is still invoked implicitly when setting trace properties.