Setting up an experiment

This page shows how write an experiment setup using LiR.

Before you begin, make sure you have a working lir command

Experiments are defined in an experiment setup file. For a quick start, use one of the examples.

An experiment setup file is a YAML file with at least the experiments property that lists the experiments, and the output_path property:

output_path: ./output  # this is where generated output is saved
experiments:
  - ... definition of experiment 1 ...
  - ... definition of experiment 2 ...

LiR uses the confidence package for parsing YAML files. Refer to its documentation to make the most of it! In particular, you may use the timestamp variable to substitute the current date and time in the YAML configuration, for example:

output_path: ./${timestamp}_output  # this is where generated output is saved
experiments:
  - ... definition of experiment 1 ...
  - ... definition of experiment 2 ...

Experiment definition

An experiment definition has all the configuration required to run an experiment. There are several ways to setup an experiment, but the most simple strategy is to use a single LR system and run it. This is the single_run strategy. Below is a fully working example:

output_path: ./output/${timestamp}_single_run_example  # This is where generated output is saved.
experiments:
  - name: my_experiment               # Choose a descriptive name for the experiment.
    strategy: single_run              # single_run is the most basic strategy.

    # The data section: which data to use and how to use it.
    data:
      provider:                       # The data provider specifies which data to use.
        method: parse_features_from_csv_url  # We use the generic CSV parser to get glass data.
        url: https://raw.githubusercontent.com/NetherlandsForensicInstitute/elemental_composition_glass/refs/heads/main/training.csv
        source_id_column: Item        # This column identifies the source.
        ignore_columns:               # We don't need these columns at the moment, so we can safely ignore them.
          - id
          - Piece

      splits:                         # The data splitter specifies how to split the data into training and test sets.
        strategy: train_test_sources  # Use a single train/test split, and split on the source IDs.
        test_size: 0.5                # Split the data 50/50.

    # The LR system section: this is where the action is.
    lr_system:
      architecture: score_based
      preprocessing: standard_scaler  # The preprocessing is applied on the single measurements, before pairing.
      pairing:                        # The pairing method governs the way the measurements are paired.
        method: source_pairs          # The source_pairing method generates one pair for each combination of sources.
        ratio_limit: 1                # Subsample different-source pairs until their number is the same as the number of
                                      # same-source pairs.
      comparing:                      # The comparing module is applied on pairs.
        method: pipeline              # We need two modules, so we combine them in a pipeline.
        steps:
          score: manhattan_distance   # First, we use the manhattan distance to calculate a "score".
          to_llr: logistic_calibrator # Next, we turn the scores into LLRs.

    # The output section: where results are aggregated.
    output:
      - method: metrics
        columns:
          - cllr
          - cllr_min

Every experiment configuration has a strategy property, which defines the type of experiment, and also which configuration settings are required. Our single_run setup has three main sections:

  • data, which defines the dataset (provider) and how it is split into training and test data (splits);

  • lr_system, which defines the LR system; and

  • output, which specifies the required output.

More advanced strategies, such as the grid strategy and the optuna strategy, are extensions of this. They can evaluate a range of LR systems for model selection, sensitivity analyses, or hyperparameter optimization.

For now, we’ll stick with the single_run strategy. Save this setup to a file named minimal-single-run.yaml and run it as:

lir minimal-single-run.yaml

This will:

  • load data from a CSV file;

  • split the data randomly into a training set and a test set;

  • train the LR system on the training set;

  • apply the LR system on the test set;

  • calculate the CLLR and the CLLR_min on the test set.

Results are written to the directory in output_path.

By default, a detailed log file is written to output_path/log.txt.

Within output_path, a directory is created with the same name as the experiment. In the above example, the experiment only has a single run, but other experiment strategies may have multiple. In that case, a separate directory is created for each run, with the following files:

  • data.yaml: the data organization that was used for the run;

  • lrsystem.yaml: the LR system setup that was used for the run.

Additionally, we expect to find the output specified in the output section. After running the example, the full directory listing is as follows.

my_experiment/
my_experiment/data.yaml
my_experiment/metrics.csv
my_experiment/lrsystem.yaml
log.txt
config.yaml

Data organization

The data provider delivers the dataset. It has at least the method property, and any other property is passed as a parameter of the data provision method.

In evaluative settings, the data provider delivers labeled data. In the setup above, that means that the instances include source ids. We don’t have hypothesis labels yet, because we evaluate the instances only after pairing them.

We could substitute the glass data for another data provider. Let’s try synthesized data:

provider:
  method: synthesized_normal_multiclass  # for a comparative evaluation, we use the multiclass variant
  seed: 0                    # the random seed
  population:
    size: 100                # the number of sources
    instances_per_source: 2  # the number of instances that are drawn for each source
  dimensions:
    - mean: 0                # the average true value of all sources
      std: 1                 # the standard deviation of the true value of all sources
      error_std:             # the standard deviation of the measurement error

In the single_run example above, we used a train test split (train_test_sources) to split the data into a training set and a test set. This is fine if you have plenty of data, but other splitting strategies use the data more efficiently. We could instead use cross validation:

splits:
  strategy: cross_validation_sources
  folds: 5

Now, if we run the experiment again, the system will be trained and applied five times, on different subsets. The five test sets are merged and we can calculate the metrics on the full dataset!

In some cases, we want full control over which instances are used for training and which ones end up in the test set. In that case, the train/test roles are assigned by the data provider, and we can use the predefined splitting strategies predefined_train_test or predefined_cross_validation.

data:
  provider:
    method: glass
    cache_dir: .glass-data
  splits:
    strategy: predefined_train_test

So far, we assumed that the instances are paired by the LR system. The data provider delivers source ids, but not hypothesis labels. If the instances are not paired, and the data provider delivers hypothesis labels, we also need to choose our splitting strategy differently. Applicable data strategies are train_test and cross_validation. See the setup file specific_source_evaluation.yaml in the examples folder for a fully working example.

LR systems

The LR system section defines the LR system. There are various architectures, and the architecture as well as the processing modules should support the input data.

In the experiment setup example above, we used the score_based architecture. That means that we can specify a preprocessing method, a pairing method, and a comparing method.

Preprocessing is done on the individual instances. The pairing method governs how instances are combined into pairs. Comparing is done after pairing and should involve calculating LLRs.

Both the preprocessing and the comparing methods are modules that transform the data, but never change the number of instances or the order of the instances. The pipeline module can be used to arrange a sequence of modules.

Above, our preprocessing was done by the scaler. The shortest way to write this in the setup is as follows.

lr_system:
  architecture: score_based
  preprocessing: standard_scaler
  ...

However, if we need to pass additional parameters to the scaler, we use the standard form.

lr_system:
  architecture: score_based
  preprocessing:
    method: standard_scaler
    with_mean: True
    with_std: True
  ...

In this form, the method specifies the module, and any other fields are passed as parameters to the module on initialization.

Still, we may not be satisfied, because we want to use more than one module for preprocessing. So, we create a pipeline.

lr_system:
  architecture: score_based
  preprocessing:
    method: pipeline
    steps:
      imputer: sklearn.impute.SimpleImputer
      scaler:
        method: standard_scaler
        with_mean: True
        with_std: True
  ...

Output

The output section declares how to aggregate the results from the test set.

Hyperparameters and data parameters

Experiments that involve multiple runs have a hyperparameters or a dataparameters section.

For example, the grid strategy runs the LR system for each combination of hyperparameter values. The optuna strategy runs the LR system a fixed number of times, while trying to optimize parameter values.

Model selection

In the single_run example, we have a fully working LR system that uses logistic regression to calculate LRs. Let’s say we want to try other LR calculation methods as well, and compare the results. To make this work, we use the same “baseline LR system”, but make two modifications to the experiment:

  • replace the single_run strategy by the grid strategy; and

  • define a hyperparameter for the LR calculation method.

output_path: ./output/${timestamp}_single_run_example  # This is where generated output is saved.
experiments:
  - name: my_model_selection_exp      # Choose a descriptive name for the experiment.
    strategy: grid                    # grid is more suitable for model selection.

    # The data section: which data to use and how to use it.
    data:
      provider:                       # The data provider specifies which data to use.
        method: parse_features_from_csv_url  # We use the generic CSV parser to get glass data.
        url: https://raw.githubusercontent.com/NetherlandsForensicInstitute/elemental_composition_glass/refs/heads/main/training.csv
        source_id_column: Item        # This column identifies the source.
        ignore_columns:               # We don't need these columns at the moment, so we can safely ignore them.
          - id
          - Piece

      splits:                         # The data splitter specifies how to split the data into training and test sets.
        strategy: train_test_sources  # Use a single train/test split, and split on the source IDs.
        test_size: 0.5                # Split the data 50/50.

    # The LR system section: this is where the action is.
    lr_system:
      architecture: score_based
      preprocessing: standard_scaler  # The preprocessing is applied on the single measurements, before pairing.
      pairing:                        # The pairing method governs the way the measurements are paired.
        method: source_pairs          # The source_pairing method generates one pair for each combination of sources.
        ratio_limit: 1                # Subsample different-source pairs until their number is the same as the number of
                                      # same-source pairs.
      comparing:                      # The comparing module is applied on pairs.
        method: pipeline              # We need two modules, so we combine them in a pipeline.
        steps:
          score: manhattan_distance   # First, we use the manhattan distance to calculate a "score".
          to_llr: logistic_calibrator # Next, we turn the scores into LLRs.

    # The output section: where results are aggregated.
    output:
      - method: metrics
        columns:
          - cllr
          - cllr_min

    hyperparameters:
      - path: comparing.steps.to_llr
        options:
          - logistic_calibrator
          - method: kde
            bandwidth: silverman
          - isotonic_calibrator

This will run the LR system three times, once for each LR calculation method. All metrics are collected in metrics.csv and its contents is the following.

lrsystem.comparing.steps.to_llr,cllr,cllr_min
logistic_calibrator,0.08701830410819332,0.06516921385665418
option1,0.05602183806606936,0.03418827108609096
isotonic_calibrator,1.0,1.0

Sensitivity analysis

We may also want to know how wel the LR system is able to cope with few data points.

Therefore, the CSV reader has the argument head to read only the first n instances. We are going to use that argument to vary the amount of input data. This is similar to the model selection setup, but since we vary the input data, we use dataparameters instead of hyperparameters, like so:

output_path: ./output/${timestamp}_single_run_example  # This is where generated output is saved.
experiments:
  - name: my_sensitivity_exp          # Choose a descriptive name for the experiment.
    strategy: grid                    # grid is more suitable for model selection.

    # The data section: which data to use and how to use it.
    data:
      provider:                       # The data provider specifies which data to use.
        method: parse_features_from_csv_url  # We use the generic CSV parser to get glass data.
        url: https://raw.githubusercontent.com/NetherlandsForensicInstitute/elemental_composition_glass/refs/heads/main/training.csv
        source_id_column: Item        # This column identifies the source.
        ignore_columns:               # We don't need these columns at the moment, so we can safely ignore them.
          - id
          - Piece
        head: 100

      splits:                         # The data splitter specifies how to split the data into training and test sets.
        strategy: train_test_sources  # Use a single train/test split, and split on the source IDs.
        test_size: 0.5                # Split the data 50/50.

    # The LR system section: this is where the action is.
    lr_system:
      architecture: score_based
      preprocessing: standard_scaler  # The preprocessing is applied on the single measurements, before pairing.
      pairing:                        # The pairing method governs the way the measurements are paired.
        method: source_pairs          # The source_pairing method generates one pair for each combination of sources.
        ratio_limit: 1                # Subsample different-source pairs until their number is the same as the number of
                                      # same-source pairs.
      comparing:                      # The comparing module is applied on pairs.
        method: pipeline              # We need two modules, so we combine them in a pipeline.
        steps:
          score: manhattan_distance   # First, we use the manhattan distance to calculate a "score".
          to_llr: logistic_calibrator # Next, we turn the scores into LLRs.

    # The output section: where results are aggregated.
    output:
      - method: metrics
        columns:
          - cllr
          - cllr_min

    dataparameters:
      - path: provider.head
        options:
          - 100
          - 200
          - 300
          - 400
          - 500
          - 600

Again, this generates the results for the different data sizes, and metrics are collected in metrics.csv.

data.provider.head,cllr,cllr_min
100,0.16605603596871377,0.0860902344426084
200,0.25262925926489155,0.12960160282804303
300,0.027107104051074863,0.0
400,0.09130910966818657,0.0
500,0.05583397551142244,0.021485955204981026
600,0.0872875165474782,0.05609640474436812

Advanced use of parameters

So far, we only used categorical parameters. There are other types of parameters, such as numerical parameters, cluster parameters or constants.

In the sensitivity analysis, the data size (provider.head) is implicitly understood to be a categorical parameter. We may instead use a numerical variable, which yields the exact same results:

output_path: ./output/${timestamp}_single_run_example  # This is where generated output is saved.
experiments:
  - name: my_sensitivity_exp          # Choose a descriptive name for the experiment.
    strategy: grid                    # grid is more suitable for model selection.

    # The data section: which data to use and how to use it.
    data:
      provider:                       # The data provider specifies which data to use.
        method: parse_features_from_csv_url  # We use the generic CSV parser to get glass data.
        url: https://raw.githubusercontent.com/NetherlandsForensicInstitute/elemental_composition_glass/refs/heads/main/training.csv
        source_id_column: Item        # This column identifies the source.
        ignore_columns:               # We don't need these columns at the moment, so we can safely ignore them.
          - id
          - Piece
        head: 100

      splits:                         # The data splitter specifies how to split the data into training and test sets.
        strategy: train_test_sources  # Use a single train/test split, and split on the source IDs.
        test_size: 0.5                # Split the data 50/50.

    # The LR system section: this is where the action is.
    lr_system:
      architecture: score_based
      preprocessing: standard_scaler  # The preprocessing is applied on the single measurements, before pairing.
      pairing:                        # The pairing method governs the way the measurements are paired.
        method: source_pairs          # The source_pairing method generates one pair for each combination of sources.
        ratio_limit: 1                # Subsample different-source pairs until their number is the same as the number of
                                      # same-source pairs.
      comparing:                      # The comparing module is applied on pairs.
        method: pipeline              # We need two modules, so we combine them in a pipeline.
        steps:
          score: manhattan_distance   # First, we use the manhattan distance to calculate a "score".
          to_llr: logistic_calibrator # Next, we turn the scores into LLRs.

    # The output section: where results are aggregated.
    output:
      - method: metrics
        columns:
          - cllr
          - cllr_min

    dataparameters:
      - path: provider.head
        low: 100
        high: 600
        step: 100

In the examples above, the parameters are automatically recognized to be categorical or numerical. However, we can also explicitly specify the parameter type. The following is equivalent to the above.

hyperparameters:
  - path: comparing.steps.to_llr
    type: categorical
    options:
      - logistic_calibrator
      - method: kde
        bandwidth: silverman
      - isotonic_calibrator

In most cases, the type property can be omitted, but it may be necessary when using custom parameter types.