Python API Changelog

This document summarizes all important API changes in the Extraction Plugin API. This document only shows changes that are important to plugin developers. For a full list of changes per version, please refer to the general changelog.

0.8.3

  • This release addresses important load balancing issues. Please use release 0.8.3 as a drop-in-replacement for releases 0.8.2 and 0.8.1.

0.8.2

  • ⚠️ This release is deprecated, please upgrade to 0.8.3

  • The build_plugin utility has been updated and the deprecation status has been removed. As with label_plugin, build_plugin now no longer requires a full (virtual) environment with all plugin dependencies and resources. This will greatly reduce build times for plugins with big dependencies and/or large models.

    The first argument of the command (a pointer to your plugin.py file) has been removed. Please do not forget to remove the first argument of build_plugin in your tox.ini or other build tooling.

    For usage read further in packaging.

  • The default read-buffer of trace.open('rb') as been changed from 1 Megabyte to 6 Megabyte to reduce overhead while data reading.

  • The data stream writer of trace.open('wb') is now buffered as well. This means that multiple small writes will be flushed after every 6 Megabytes of data has been written (or when the writer is closed).

  • The read-buffer or write-buffer size can be overridden by the user, by passing the buffer_size= argument to trace.open():

    with trace.open('rb', buffer_size=1024*1024):  # set a 1 Megabyte buffer size
        pass
    
    with trace.open('wb', buffer_size=1024*1024*12):  # set a 12 Megabyte buffer size
        pass
    
    with trace.open('wb', buffer_size=1):  # a buffer_size of 1 effectively disables the buffer:
        pass                               # each write will be flushed to Hansken directly
    
  • It is now possible to write str values to trace.open(..). To do so, pass mode='w' as additional argument. By default, it is assumed that the written text is ‘utf-8’ encoded. The default can be overwritten by using the 'encoding=' argument.

    In a future Hansken update, Hansken will set the correct data-stream properties for your text stream (mimeType, mimeClass, and fileType).

    Example use cases are:

    • write picture-to-text (OCR) data to a trace

    • write translations to a trace

    • write audio-to-text (audio transcriptions) to a trace

    • write the results of a JSON dump, e.g.: json.dump(['your', 'data'], text_writer)

    Examples in code:

    with trace.open(data_type='raw', mode='w', encoding='utf-8') as text_writer:
        text_writer.write('hello.world')  # write strings directly to it
        json.dump({'hello': 'world'}, text_writer)  # or pass the writer to json.dump
    

    See also the python code snippet.

0.8.1

  • ⚠️ This release is deprecated, please upgrade to 0.8.3

0.8.0

  • The trace property imageId is renamed to image. This is to be in line with the Hansken REST API and Python API. When updating your plugin, please update your calls trace.get('imageId') to trace.get('image').

  • #774 By default, deferred extraction plugin searches are now scoped to the image of the trace that is currently being processed. Optionally, a project-wide search can be done by passing an optional scope argument.

    def process(trace, data_context, searcher):
        # only search for traces inside the same image as the trace that is being processed
        searcher.search('*')
        searcher.search('*', scope='image')  # explicit alternative, using a str
        searcher.search('*', scope=SearchScope.image)  # explicit alternative, using the SearchScope enum
    
        # only search for traces inside the same image as the trace that is being processed
        searcher.search('*', scope='project')
        searcher.search('*', scope=SearchScope.project)
    
  • Support trace properties of type list[float]. This enables you to write multiple offsets and confidence scores in tracelets of type prediction.

    For example:

    trace.add_tracelet('prediction', {
        'modelName': 'my_cat_detector',
        'modelVersion': '0.0.BETA',
        'type': 'classification',
        'label': 'cat',
    
        # the best score
        'offset': 3.0,
        'confidence': 0.4,
    
        # all scores
        'offsets':     [0.0, 3.0, 6.0, 9.0],
        'confidences': [0.1, 0.4, 0.03, 0.09],
    })
    

0.7.3

  • This version introduces a new docker image build utility label_plugin. This utility will eventually replace build_plugin. build_plugin is now deprecated.

    label_plugin is a utility to add labels to an extraction plugin image. Labeling a plugin is required for Hansken to detect extraction plugins in a plugin image registry.

    To label a plugin, first build the plugin image with docker build; for example by using one of the following commands:

    docker build . -t my_plugin
    docker build . -t my_plugin --build-arg https_proxy=http://your_proxy:8080
    

    Next, run the label_plugin utility to label the build plugin container:

    label_plugin my_plugin
    

    The result of label_plugin is a plugin image that can be uploaded to Hansken.

    label_plugin is preferred over build_plugin, as it does not require a full (virtual) environment with all plugin dependencies and resources. This is especially preferred when the plugin uses (big) data models or (external) dependencies.

    For usage read further in packaging.

0.7.0

  • Escaping the / character in matchers is optional. This simplifies and aims for better HQL and HQL-Lite compatability. See for more information and examples the HQL-Lite syntax documentation.

    Examples:

    • Old: file.path:\/Users\/*\/AppData -> new: file.path:/Users/*/AppData

    • Old: file.path:\\/Users\\/*\\/AppData -> new: file.path:/Users/*/AppData

    • Old: registryEntry.key:\/Software\/Dropbox\/ks*\/Client-p -> new: registryEntry.key:/Software/Dropbox/ks*/Client-p

  • Hansken returns file.path properties (outside the scope of matchers) as a String property, instead of a list of strings. Example: trace.get('file.path') now returns '/dev/null', this was ['dev', 'null'].

  • Improved plugin loading when using serve_plugin and build_plugin: import statements now work for modules (python files) that are located the same directory structure of a plugin.

  • A plugin can now stream data to a trace using trace.open(mode='wb'). This removes the limit on the size of data that could be written. See also the python code snippet.

    Example:

    with trace.open(mode='wb') as writer:
        writer.write(b'a string')
        writer.write(bytes(another_string, 'utf-8'))
    

    note: this does not work when using run_with_hanskenpy.

0.6.1

  • The docker image build script build_plugin has been updated to allow for extension of the docker command. This can be especially handy for specifying a proxy. You should build your plugin container image with the following command:

    build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY [DOCKER_IMAGE_NAME] [DOCKER_ARGS]
    

    Warning

    Note that the DOCKER_IMAGE_NAME argument no longer requires a -n parameter to be specified.

    For usage read further in packaging.

0.6.0

Warning

This is an API breaking change. Upgrading your plugin to this version will require code changes. Plugins built with previous versions of the SDK from 0.3.0 will still work with Hansken.

Warning

It is strongly recommended to upgrade your plugins to this new version because it significantly improves the start-up time of Hansken. See the migration steps below.

This release contains both build pipeline changes and API changes. Please read all changes carefully.

Build pipeline change

  • Extraction plugin container images are now labeled with PluginInfo. This allows Hansken to efficiently load extraction plugins. Migration steps from earlier versions:

    1. Update the SDK version in your setup.py / requirements.txt

    2. If you come from a version prior to 0.4.0, or if you use a plugin name instead of a plugin id in your pluginInfo(), switch to the plugin id style (read instructions for version 0.4.0)

    3. Update your build scripts to build your plugin (Docker) container image. Be sure to have the Extraction Plugins SDK installed. Then, you should build your plugin container image with the following command:

      build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY -n [DOCKER_IMAGE_NAME]
      

      For example:

      build_plugin plugin/chatplugin.py . -n extraction-plugins/chatplugin
      

      This will generate a plugin image:

      • The extraction plugin is added to your local image registry (docker images),

      • Note that DOCKER_IMAGE_NAME is optional and will default to extraction-plugin/PLUGINID, e.g. extraction-plugin/nfi.nl/extract/chat/whatsapp,

      • The image is tagged with two tags: latest, and your plugin version.

API changes

  • The field plugin has been removed from PluginInfo.

  • The field pluginId should now be the first argument of PluginInfo (when using unnamed arguments).

    Old (unnamed arguments):

    def plugin_info(self):
        return PluginInfo(self, '1.0.0', 'description', author,
                          MaturityLevel.PROOF_OF_CONCEPT, '*, 'https://hansken.org',
                          PluginId(...), 'Apache License 2.0')
    

    New (removed self, and moved PluginId(...) to first argument position):

    def plugin_info(self):
        return PluginInfo(PluginId(...), '1.0.0', 'description',
                          author, MaturityLevel.PROOF_OF_CONCEPT,
                          '*', 'https://hansken.org', 'Apache License 2.0')
    

    Old (named arguments):

    def plugin_info(self):
        return PluginInfo(plugin=self,
                          version='1.0.0',
                          ...)
    

    New (removed plugin=self):

    def plugin_info(self):
        return PluginInfo(version='1.0.0',
                          ...)
    
  • Plugin data_context.data_size is now a variable instead of a method:

    Old:

    def process(self, trace: ExtractionTrace, data_context: DataContext):
        size = data_context.data_size()
    

    New:

    def process(self, trace: ExtractionTrace, data_context: DataContext):
        size = data_context.data_size
    
  • Simplify declaring required runtime resources in a plugin’s info.

    Extraction plugin resources don’t use the builder pattern anymore.

    Old:

    return PluginInfo(
        ...,
        resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
    )
    

    New:

    # no need for a builder, declare resources by direct instantiation
    return PluginInfo(
        ...,
        resources=PluginResources(maximum_cpu=2.0, maximum_memory=2048)
    )
    # or, as before, specify just on resource
    return PluginInfo(
        ...,
        resources=PluginResources(maximum_memory=4096)
    )
    

0.5.1

  • Simplify tracelet properties by making the tracelet type prefix optional.

    # using a Tracelet object
    trace.add_tracelet(Tracelet("prediction", {
        "type": "example",
        "confidence": 0.8
    }))
    # or without a Tracelet object
    trace.add_tracelet("identity", {"name": "John Doe", "status": "online"})
    
  • Enabled manual plugin testing, as described on advanced use of the test framework in Python.

0.5.0

  • Support vector data type in trace properties.

    embedding = Vector.from_sequence((width, height))
    tracelet = Tracelet("prediction", {
        "prediction.type": "example-vector",
        "prediction.embedding": embedding
    })
    trace.add_tracelet(tracelet)
    

0.4.13

  • When writing input search traces for tests, it is no longer required to explicitly set an id property. These are automatically generated when executing tests.

0.4.7

  • More $data matchers are supported in Hansken.py plugin runner. Before this improvement it was only possible to match on $data.type. Now it is also possible to match for example on $data.mimeType and $data.mimeClass. The $data matcher should still be at the end of the query as before.

0.4.6

  • It is now possible to specify maximum system resources in the PluginInfo. To run a plugin with 0.5 cpu (= 0.5 vCPU/Core/hyperthread) and 1 gb memory, for example, the following configuration can be added to PluginInfo:

    plugin_info = PluginInfo(...,
                             resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
    

0.4.0

  • Extraction Plugins are now identified with a PluginInfo.PluginId containing a domain, category and name. The method PluginInfo.name(pluginName) has been replaced by PluginInfo.id(new PluginId(domain, category, name). More details on the plugin naming conventions can be found at the Plugin naming convention section.

  • PluginInfo.name() is now deprecated (but will still work for backwards compatibility).

  • A new license field PluginInfo.license has also been added in this release.

  • The following example creates a PluginInfo for a plugin with the name TestPlugin, licensed under the Apache License 2.0 license:

    class TestPlugin(ExtractionPlugin):
        def plugin_info(self) -> PluginInfo:
            return PluginInfo(self,
                              version='1.0.0',
                              description='A plugin for testing.',
                              author=Author('The Externals', 'tester@holmes.nl', 'NFI'),
                              maturity=MaturityLevel.PROOF_OF_CONCEPT,
                              webpage_url='https://hansken.org',
                              matcher='file.extension=txt',
                              id=PluginId(domain='nfi.nl', category='test', name='TestPlugin'),
                              license='Apache License 2.0'
                              )
    

0.3.0

  • Extraction Plugins can now create new datastreams on a Trace through data transformations. Data transformations describe how data can be obtained from a source.

    An example case is an extraction plugin that processes an archive file. The plugin creates a child trace per entry in the archive file. Each child trace will have a datastream that is a transformation that marks the start and length of the entry in the original archive data. By just describing the data instead of specifying the actual data, a lot of space is saved.

    Although Hansken supports various transformations, the Extraction Plugins SDK for now only supports ranged data transformations. Ranged data transformations define data as a list of ranges, each range with an offset and length in a bytearray.

    The following example sets a new datastream with dataType html on a trace, by setting a ranged data transformation:

    trace.add_transformation('html', RangedTransformation(Range(offset, length)))
    

    The following example creates a child trace and sets a new datastream with dataType raw on it, by setting a ranged data transformation with two ranges:

    child = trace.child_builder('new trace')
    child.add_transformation('raw', RangedTransformation.builder()
                                                        .add_range(10, 20)
                                                        .add_range(50, 30)
                                                        .build())
    });
    

    More detailed documentation will follow in an upcoming SDK release.

0.2.0

Warning

This is an API breaking change. Plugins created with an earlier version of the extraction plugin SDK are not compatible with Hansken that uses 0.2.0 or later.

  • Introduced a new extraction plugin type api.extraction_plugin.DeferredExtractioPlugin. Deferred Extraction plugins can be run at a different extraction stage. This type of plugin also allows accessing other traces using the searcher.

  • The class api.extraction_context.ExtractionContext has been renamed to api.data_context.DataContext. The new name DataContext represents the class contents better. Plugins have to update matching import statements accordingly. Plugins should also update the named argument context to data_context of the plugin process() method. This change has no functional changes.

    Old:

    from hansken_extraction_plugin.api.extraction_context import ExtractionContext
    
    def process(self, trace, context):
      pass
    

    New:

    from hansken_extraction_plugin.api.data_context import DataContext
    
    def process(self, trace, data_context):
      pass
    
  • Moved api.author.Author to api.plugin_info.Author, and moved api.maturity_level.MaturityLevel to api.plugin_info.MaturityLevel This is a more pythonic way of grouping of classes into modules. This change has no functional side effects.

    Plugins have to update matching import statements accordingly.

    Old:

    from hansken_extraction_plugin.api.author import Author
    from hansken_extraction_plugin.api.maturity_level import MaturityLevel
    from hansken_extraction_plugin.api.plugin_info import PluginInfo
    

    New:

    from hansken_extraction_plugin.api.plugin_info import Author, MaturityLevel, PluginInfo
    
  • Removed DataContext.get_first_bytes() from the public API.

  • Removed api.extraction_trace.validate_update_arguments(..) from the public API. This method is still invoked implicitly when setting trace properties.