Setting up an experiment ======================== This page shows how write an experiment setup using LiR. Before you begin, make sure you have a working `lir command`_ Experiments are defined in an experiment setup file. For a quick start, use one of the `examples`_. An experiment setup file is a YAML file with at least the ``experiments`` property that lists the experiments, and the ``output_path`` property: .. code-block:: yaml output_path: ./output # this is where generated output is saved experiments: - ... definition of experiment 1 ... - ... definition of experiment 2 ... LiR uses the ``confidence`` package for parsing YAML files. Refer to its `documentation`_ to make the most of it! In particular, you may use the ``timestamp`` variable to substitute the current date and time in the YAML configuration, for example: .. code-block:: yaml output_path: ./${timestamp}_output # this is where generated output is saved experiments: - ... definition of experiment 1 ... - ... definition of experiment 2 ... .. _lir command: index.html .. _examples: lrsystem_yaml.html .. _documentation: https://github.com/NetherlandsForensicInstitute/confidence Experiment definition --------------------- An experiment definition has all the configuration required to run an experiment. There are several ways to setup an experiment, but the most simple strategy is to use a single LR system and run it. This is the ``single_run`` strategy. Below is a fully working example: .. literalinclude:: snippets/minimal-single-run.yaml :language: yaml Every experiment configuration has a ``strategy`` property, which defines the type of experiment, and also which configuration settings are required. Our ``single_run`` setup has three main sections: - ``data``, which defines the dataset (``provider``) and how it is split into training and test data (``splits``); - ``lr_system``, which defines the LR system; and - ``output``, which specifies the required output. More advanced `strategies`_, such as the ``grid`` strategy and the ``optuna`` strategy, are extensions of this. They can evaluate a range of LR systems for model selection, sensitivity analyses, or hyperparameter optimization. .. _strategies: reference.html#experiment-strategies For now, we'll stick with the ``single_run`` strategy. Save this setup to a file named ``minimal-single-run.yaml`` and run it as: .. code-block:: shell lir minimal-single-run.yaml This will: - load data from a CSV file; - split the data randomly into a training set and a test set; - train the LR system on the training set; - apply the LR system on the test set; - calculate the CLLR and the CLLR_min on the test set. Results are written to the directory in ``output_path``. By default, a detailed log file is written to ``output_path/log.txt``. Within ``output_path``, a directory is created with the same name as the experiment. In the above example, the experiment only has a single run, but other experiment strategies may have multiple. In that case, a separate directory is created for each run, with the following files: - ``data.yaml``: the data organization that was used for the run; - ``lrsystem.yaml``: the LR system setup that was used for the run. Additionally, we expect to find the output specified in the ``output`` section. After running the example, the full directory listing is as follows. .. jupyter-execute:: :hide-code: import tempfile import glob import os import lir.main def run_and_list_directory(setup_file: str): with tempfile.TemporaryDirectory() as tmpdirname: lir.main.main([setup_file, '--set', f'output_path={tmpdirname}']) for filename in glob.glob(f'{tmpdirname}/**', recursive=True): shortname = filename[len(tmpdirname)+1:] if not shortname: pass # do not print the root elif os.path.isdir(filename): print(f'{shortname}/') else: print(shortname) run_and_list_directory('docs/snippets/minimal-single-run.yaml') Data organization ----------------- The `data provider`_ delivers the dataset. It has at least the ``method`` property, and any other property is passed as a parameter of the data provision method. In evaluative settings, the data provider delivers **labeled data**. In the setup above, that means that the instances include **source ids**. We don't have hypothesis labels yet, because we evaluate the instances only after pairing them. We could substitute the glass data for another data provider. Let's try synthesized data: .. code-block:: yaml provider: method: synthesized_normal_multiclass # for a comparative evaluation, we use the multiclass variant seed: 0 # the random seed population: size: 100 # the number of sources instances_per_source: 2 # the number of instances that are drawn for each source dimensions: - mean: 0 # the average true value of all sources std: 1 # the standard deviation of the true value of all sources error_std: # the standard deviation of the measurement error In the ``single_run`` example above, we used a train test split (``train_test_sources``) to split the data into a training set and a test set. This is fine if you have plenty of data, but other `splitting strategies`_ use the data more efficiently. We could instead use `cross validation`_: .. _cross validation: https://en.wikipedia.org/wiki/Cross-validation_(statistics) .. code-block:: yaml splits: strategy: cross_validation_sources folds: 5 Now, if we run the experiment again, the system will be trained and applied five times, on different subsets. The five test sets are merged and we can calculate the metrics on the full dataset! In some cases, we want full control over which instances are used for training and which ones end up in the test set. In that case, the train/test roles are assigned by the data provider, and we can use the predefined splitting strategies ``predefined_train_test`` or ``predefined_cross_validation``. .. code-block:: yaml data: provider: method: glass cache_dir: .glass-data splits: strategy: predefined_train_test .. _data provider: reference.html#data-providers .. _splitting strategies: reference.html#data-strategies So far, we assumed that the instances are paired by the LR system. The data provider delivers source ids, but not hypothesis labels. If the instances are not paired, and the data provider delivers hypothesis labels, we also need to choose our splitting strategy differently. Applicable data strategies are ``train_test`` and ``cross_validation``. See the setup file ``specific_source_evaluation.yaml`` in the ``examples`` folder for a fully working example. LR systems ---------- The LR system section defines the LR system. There are various `architectures`_, and the architecture as well as the processing modules should support the input data. In the experiment setup example above, we used the ``score_based`` architecture. That means that we can specify a preprocessing method, a pairing method, and a comparing method. Preprocessing is done on the individual instances. The `pairing method`_ governs how instances are combined into pairs. Comparing is done after pairing and should involve calculating LLRs. Both the preprocessing and the comparing methods are `modules`_ that transform the data, but *never* change the number of instances or the order of the instances. The ``pipeline`` module can be used to arrange a sequence of modules. .. _architectures: reference.html#lr-system-architecture .. _modules: reference.html#lr-system-modules .. _pairing method: reference.html#pairing-methods Above, our preprocessing was done by the scaler. The **shortest** way to write this in the setup is as follows. .. code-block:: yaml lr_system: architecture: score_based preprocessing: standard_scaler ... However, if we need to pass additional **parameters** to the scaler, we use the standard form. .. code-block:: yaml lr_system: architecture: score_based preprocessing: method: standard_scaler with_mean: True with_std: True ... In this form, the ``method`` specifies the module, and any other fields are passed as parameters to the module on initialization. Still, we may not be satisfied, because we want to use more than one module for preprocessing. So, we create a pipeline. .. code-block:: yaml lr_system: architecture: score_based preprocessing: method: pipeline steps: imputer: sklearn.impute.SimpleImputer scaler: method: standard_scaler with_mean: True with_std: True ... Output ------ The output section declares how to aggregate the results from the test set. Hyperparameters and data parameters ----------------------------------- `Experiments`_ that involve multiple runs have a ``hyperparameters`` or a ``dataparameters`` section. .. _Experiments: reference.html#experiment-strategies For example, the ``grid`` strategy runs the LR system for each combination of hyperparameter values. The ``optuna`` strategy runs the LR system a fixed number of times, while trying to optimize parameter values. Model selection ^^^^^^^^^^^^^^^ In the ``single_run`` example, we have a fully working LR system that uses logistic regression to calculate LRs. Let's say we want to try other LR calculation methods as well, and compare the results. To make this work, we use the same "baseline LR system", but make two modifications to the experiment: - replace the ``single_run`` strategy by the ``grid`` strategy; and - define a hyperparameter for the LR calculation method. .. literalinclude:: snippets/model-selection.yaml :language: yaml :emphasize-lines: 3-4,41-47 This will run the LR system three times, once for each LR calculation method. All metrics are collected in ``metrics.csv`` and its contents is the following. .. jupyter-execute:: :hide-code: with tempfile.TemporaryDirectory() as tmpdirname: lir.main.main(['docs/snippets/model-selection.yaml', '--set', f'output_path={tmpdirname}']) with open(f'{tmpdirname}/my_model_selection_exp/metrics.csv', 'r') as f: print(f.read()) Sensitivity analysis ^^^^^^^^^^^^^^^^^^^^ We may also want to know how wel the LR system is able to cope with few data points. Therefore, the CSV reader has the argument ``head`` to read only the first ``n`` instances. We are going to use that argument to vary the amount of input data. This is similar to the model selection setup, but since we vary the input data, we use ``dataparameters`` instead of ``hyperparameters``, like so: .. literalinclude:: snippets/sensitivity-analysis.yaml :language: yaml :emphasize-lines: 3,16,42-50 Again, this generates the results for the different data sizes, and metrics are collected in ``metrics.csv``. .. jupyter-execute:: :hide-code: with tempfile.TemporaryDirectory() as tmpdirname: lir.main.main(['docs/snippets/sensitivity-analysis.yaml', '--set', f'output_path={tmpdirname}']) with open(f'{tmpdirname}/my_sensitivity_exp/metrics.csv', 'r') as f: print(f.read()) Advanced use of parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^ So far, we only used categorical parameters. There are `other types of parameters`_, such as numerical parameters, cluster parameters or constants. In the sensitivity analysis, the data size (``provider.head``) is implicitly understood to be a categorical parameter. We may instead use a numerical variable, which yields the exact same results: .. literalinclude:: snippets/sensitivity-analysis-numerical.yaml :language: yaml :emphasize-lines: 43-46 In the examples above, the parameters are automatically recognized to be categorical or numerical. However, we can also explicitly specify the parameter type. The following is equivalent to the above. .. code-block:: yaml :emphasize-lines: 2 hyperparameters: - path: comparing.steps.to_llr type: categorical options: - logistic_calibrator - method: kde bandwidth: silverman - isotonic_calibrator In most cases, the ``type`` property can be omitted, but it may be necessary when using custom parameter types. .. _other types of parameters: reference.html#hyperparameters