How to debug an Extraction Plugin

Debugging is the art of removing bugs — hopefully quickly.

Locally

To debug a plugin locally, it is recommended to start the plugin via the IDE. This has the advantage that breakpoints can easily be put in the code instead of printing log statements, for example. To start a plugin locally, a piece of code must be added, see Testing for more information.

Logging

The logging of the extraction plugin is displayed in the console.

Locally with Docker

Debugging an extraction plugin via docker is a bit trickier. In order to debug in Python, a debugger must be added to the extraction plugin. There are several debug modules for Python available, but one debug module that works well with Visual Studio Code is debugpy. This package is developed by Microsoft specifically for use in Visual Studio Code with Python.

Note

debugpy implements the Debug Adapter Protocol (DAP), which is a standardised way for development tools to communicate with debuggers.

Using debugpy with Docker containers requires 4 distinct steps:

  1. Install debugpy

  2. Configuring debugpy in Python

  3. Build a docker image

  4. Configuring the connection to the Docker container

  5. Setting breakpoints in your code

Install debugpy

First, add debugpy to your setup.py.

from setuptools import setup

setup(
    # ...
    install_requires=[
        "hansken-extraction-plugin==0.4.7",  # the plugin SDK
        "debugpy==1.5.1"
    ]
)

Configuring debugpy in Python

At the beginning of your script, import debugpy, and call debugpy.listen() to start the debug adapter, passing a (host, port) tuple as the first argument. Use the debugpy.wait_for_client() function to block program execution until the client is attached.

import debugpy

debugpy.listen(("0.0.0.0", 5678))
debugpy.wait_for_client()  # blocks execution until client is attached

# your extraction plugin code

Build a Docker image

If the Docker image is not built, first build the image as described here.

Configuring the connection to the Docker container

debugpy is now set up to accept connections inside a Docker container. To connect to debugpy in the docker container, port 5678 must be published. To make a port available to services outside of Docker, use the –publish or -p flag. This creates a firewall rule which maps a container port to a port on the Docker host to the outside world.

To run the extraction plugin with the published port the following command can be used:

docker run -p 5678:5678 your_extraction_plugin_name

The next step is to configure Visual Studio Code. A launch.json file must be created in order for Visual Studio Code to connect to the extraction plugin in Docker. This minimal launch.json example below tells the debugger to attach to localhost on port 5678.

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Python: Remote Attach",
      "type": "python",
      "request": "attach",
      "connect": {
        "host": "localhost",
        "port": 5678
      },
      "pathMappings": [
        {
          "localRoot": "${workspaceFolder}",
          "remoteRoot": "."
        }
      ]
    }
  ]
}

Setting breakpoints in the code

The last step is to add breakpoints in the code.

Logging in Docker

The logging of the extraction plugin is displayed in the console after running the docker run command. In addition, the logging is also displayed in the Visual Studio Code console while debugging.

Kubernetes

In kubernetes it is currently not possible to debug via debugpy because no debug ports are published.

Logging in Kubernetes

If there is authorization to the kubernetes cluster, the logging can be viewed with the following command:

kubectl logs -f hansken-extraction-plugins/your_extraction_plugin_pod

Debug HQL

An HQL query can be debugged by running the test framework with the --verbose option enabled. The found HQL matches will then be displayed in the console. To test a plugin in python with the --verbose option enabled use the following command:

test_plugin --standalone plugin/your_plugin.py --regenerate --verbose

The following output will then be displayed in the console:

HQL match found for:
$data.type=jpg
With trace:
dataType=jpg
types={file, data}
properties={data.raw.mimeType=image/jpg, path=/test-input-trace, file.name=image.jpg, name=test-input-trace, id=0}

If the HQL query contains an error, it will be shown in the generated test results. An example of an invalid query is $data.mimeType=image/jpg (slash not escaped). This query will produce an error like the one shown below.

{
  "class": "org.hansken.plugin.extraction.hql_lite.lang.ParseException",
  "message": "HqlLiteHumanQueryParser: line 1:20 token recognition error at: '/jpg'"
}

Note

The error is only shown in the generated trace, so to find out the ParseException run the test_plugin command with the --regenerate option enabled.