Assessing LR system performance
===============================

This page shows how LiR can be used to assess the `performance`_ of an LR system, in particular discriminatory power
and `consistency`_. These metrics are typically used to judge whether an LR system is good enough for use in casework,
or to find the best LR system among several candidates.

**Note:** in literature, the term 'calibration' is sometimes used to refer to whether an LR system is 'well-calibrated'
or 'consistent'. Here, we reserve `calibration` for the process, while we use 'consistency' for the quality of the LLRs.


Metrics
-------

A `widely used`_ metric of performance is Cllr, which measures both discrimination and consistency.
Other metrics may have practical use as well. See the table below for a list of metrics.

.. _performance: https://doi.org/10.1016/j.forsciint.2016.03.048
.. _widely used: https://doi.org/10.1016/j.fsisyn.2024.100466
.. _consistency: https://doi.org/10.1016/j.forsciint.2021.110722

+-------------------------------------+--------------------------------+
| Metric                              | Assessment of                  |
+=====================================+================================+
| log likelihood ratio cost (`cllr`_) | discrimination and consistency |
+-------------------------------------+--------------------------------+
| minimized cllr (`cllr_min`_)        | discrimination                 |
+-------------------------------------+--------------------------------+
| calibration loss (`cllr_cal`_)      | consistency                    |
+-------------------------------------+--------------------------------+
| rate of misleading evidence         | mostly consistency             |
+-------------------------------------+--------------------------------+
| `devPAV`_                           | consistency                    |
+-------------------------------------+--------------------------------+
| expected LR for both hypotheses     | discrimination                 |
+-------------------------------------+--------------------------------+

By definition, the log likelihood ratio cost ``cllr`` equals ``cllr_min`` + ``cllr_cal``.

.. _cllr: api/lir.metrics.html#lir.metrics.cllr
.. _cllr_min: api/lir.metrics.html#lir.metrics.cllr_min
.. _cllr_cal: api/lir.metrics.html#lir.metrics.cllr_cal
.. _devPAV: api/lir.metrics.html#lir.algorithms.devpav.devpav


Visualizations
--------------

While a one-dimensional metric is often useful, visualizations give more insight in the behavior of the LR system.
Examples of visualizations are:

- `LR histogram`_, for assessment of discrimination
- Pool adjacent violators (`PAV`_) transformation, for assessment of consistency
- Empirical cross-entropy (`ECE`_)
- `Tippett`_

.. _LR histogram: api/lir.plotting.html#lir.plotting.lr_histogram
.. _PAV: api/lir.plotting.html#lir.plotting.pav
.. _ECE: api/lir.plotting.html#lir.plotting.expected_calibration_error.plot_ece
.. _Tippett: api/lir.plotting.html#lir.plotting.tippett

The `PAV transformation`_ is particularly useful to inspect a system (or a set of LLRs, actually) for consistency. It
optimizes a set of LLRs (for which the ground truth is known) for consistency without changing the order of the LLRs.
The corresponding visualization shows a scatter plot of the original, "pre-calibrated" LLRs versus the "post-calibrated"
LLRs, after transformation.

.. _PAV transformation: https://en.wikipedia.org/wiki/Isotonic_regression

How to read a PAV plot? The example below shows how to interpret the different sections of the plot.

.. jupyter-execute::
    :hide-code:

    from matplotlib.patches import Polygon
    from sklearn.preprocessing import StandardScaler

    from lir import plotting
    from lir.algorithms.bayeserror import ELUBBounder
    from lir.algorithms.logistic_regression import LogitCalibrator
    from lir.datasets.glass import GlassData
    from lir.data_strategies import PredefinedTrainTestSplit
    from lir.lrsystems.score_based import ScoreBasedSystem
    from lir.transform import as_transformer
    from lir.transform.distance import ManhattanDistance
    from lir.transform.pairing import SourcePairing
    from lir.transform.pipeline import Pipeline

    instances = GlassData(cache_dir='glass-data').get_instances()
    train, test = next(PredefinedTrainTestSplit().apply(instances))

    scoring = Pipeline(steps=[('diff', ManhattanDistance()), ('calib', LogitCalibrator(random_state=0)), ('elub', ELUBBounder())])
    lrsystem = ScoreBasedSystem(preprocessing_pipeline=as_transformer(StandardScaler()), evaluation_pipeline=scoring, pairing_function=SourcePairing(ratio_limit=1, seed=0))

    lrsystem.fit(train)
    llrs = lrsystem.apply(test)

    with plotting.show() as ax:
        ax.pav(llrs)
        ax.set_xlim(-1.5, 1.5)
        ax.set_ylim(-1.5, 1.5)
        x0, x1 = ax.get_xlim()
        y0, y1 = ax.get_ylim()

        poly = Polygon([(max(x0, y0), max(x0, y0)), (min(x1, y1), min(x1, y1)), (x0, y1)], facecolor='orange', edgecolor='0.5', alpha=.2)
        ax.add_patch(poly)
        ax.text(x0*.5, y1*.25, "bias towards H2", horizontalalignment='center', fontsize=14)

        poly = Polygon([(max(x0, y0), max(x0, y0)), (min(x1, y1), min(x1, y1)), (x1, y0)], facecolor='blue', edgecolor='0.5', alpha=.2)
        ax.add_patch(poly)
        ax.text(x1*.5, y0*.5, "bias towards H1", horizontalalignment='center', fontsize=14)

- LLRs that are **on the diagonal** remained unchanged and appear to be well-calibrated. Ideally, all LLRs are somewhere
  near the diagonal. In that case, the calibration loss will be close to 0.
- LLRs that appear **above the diagonal** are increased after optimization, and the original LLRs were therefore
  **biased towards H2**.
- LLRs thet appear **below the diagonal** are decreased after optimization, and the original LLRs were therefore
  **biased towards H1**.

Another way to look at it, is to distinguish between overestimated and underestimated LLRs.

.. jupyter-execute::
    :hide-code:

    with plotting.show() as ax:
        ax.pav(llrs)
        ax.set_xlim(-1.5, 1.5)
        ax.set_ylim(-1.5, 1.5)
        x0, x1 = ax.get_xlim()
        y0, y1 = ax.get_ylim()

        #poly = Polygon([(0, 0), (x1, 0), (x1, y0), (0, y0)], facecolor='red', edgecolor='0.5', alpha=.2)
        #ax.add_patch(poly)
        #ax.text(x1/2, y0*.5, "misleading", horizontalalignment='center', fontsize=14)

        #poly = Polygon([(0, 0), (x0, 0), (x0, y1), (0, y1)], facecolor='red', edgecolor='0.5', alpha=.2)
        #ax.add_patch(poly)
        #ax.text(x0/2, y1*.25, "misleading", horizontalalignment='center', fontsize=14)

        poly = Polygon([(0, 0), (min(x1, y1), min(x1, y1)), (x1, y0), (0, y0)], facecolor='orange', edgecolor='0.5', alpha=.2)
        ax.add_patch(poly)
        ax.text(x1*.65, y0*.5, "overestimated", horizontalalignment='center', fontsize=14)

        poly = Polygon([(0, 0), (max(x0, y0), max(x0, y0)), (x0, y1), (0, y1)], facecolor='orange', edgecolor='0.5', alpha=.2)
        ax.add_patch(poly)
        ax.text(x0*.65, y1*.25, "overestimated", horizontalalignment='center', fontsize=14)

        poly = Polygon([(0, 0), (min(x1, y1), min(x1, y1)), (0, y1)], facecolor='blue', edgecolor='0.5', alpha=.2)
        ax.add_patch(poly)
        ax.text(x1*.35, y1*.75, "underestimated", horizontalalignment='center', fontsize=14)

        poly = Polygon([(0, 0), (max(x0, y0), max(x0, y0)), (0, y0)], facecolor='blue', edgecolor='0.5', alpha=.2)
        ax.add_patch(poly)
        ax.text(x0*.35, y0*.75, "underestimated", horizontalalignment='center', fontsize=14)

- Positive LLRs that appear above the diagonal, and negative LLRs that appear below the diagonal, became stronger after
  optimization (i.e. further away from 0), and the original LLRs were therefore **underestimated**.
- Positive LLRs that appear below the diagonal, and negative LLRs that appear above the diagonal, became weaker after
  optimization (i.e. closer to 0), and the original LLRs were therefore **overestimated**.

In any case, bias and overestimation does not say anything about the ground truth of individual instances. In a
particular dataset, LLRs of 3 can be underestimated, but that doesn't mean that there cannot be an instance with LLR=3
whose ground truth is H2!


Appearance plots and metrics
----------------------------

Let's see how these metrics and visualizations behave on different types of data.


Neutral LLRs
^^^^^^^^^^^^

First, non-informative data, where all LLRs are zero (i.e. neutral). These data are not discriminative, but perfectly
consistent!

.. jupyter-execute::

    import numpy as np
    import matplotlib.pyplot as plt
    from lir.data.models import LLRData
    from lir.metrics import cllr, cllr_min, cllr_cal
    from lir.algorithms.devpav import devpav
    from lir.plotting import lr_histogram, pav, tippett
    from lir.plotting.expected_calibration_error import plot_ece

    plt.rcParams.update({'font.size': 9})

    def llr_metrics_and_visualizations(llrs: LLRData):
        # print the metrics
        print(f'cllr: {cllr(llrs)}')
        print(f'cllr_min: {cllr_min(llrs)}')
        print(f'cllr_cal: {cllr_cal(llrs)}')
        print(f'devpav: {devpav(llrs)}')

        # initialize the plot
        fig, ((ax_lrhist, ax_pav), (ax_ece, ax_tippett)) = plt.subplots(2, 2)
        fig.set_figwidth(10)

        # create the visualizations
        lr_histogram(ax_lrhist, llrs)
        pav(ax_pav, llrs)
        plot_ece(ax_ece, llrs)
        tippett(ax_tippett, llrs)

        # generate the image
        fig.tight_layout()
        fig.show()

    # generate neutral LLRs
    llrs = LLRData(features=np.zeros((6, 1)), labels=np.array([0, 0, 0, 1, 1, 1]))

    # show results
    print('results for neutral (non-informative) LLR values')
    llr_metrics_and_visualizations(llrs)

Observe that:

- the value for ``cllr`` is 1;
- the value for ``cllr_min`` is 1;
- the value for ``cllr_cal`` is 0;
- the LR histogram shows a single bar;
- in the ECE plot, the LRs line, the reference line and the PAV LRs line are the equivalent;
- the PAV plot and the Tippett plot hardly make sense if all LLRs have the same value.


Well-calibrated LLRs
^^^^^^^^^^^^^^^^^^^^

Now, we have LLRs that are both discriminative and consistent, and data of both hypotheses are drawn from a normal
distribution. It visualizes as follows.

.. jupyter-execute::

    from lir.algorithms.logistic_regression import LogitCalibrator
    from lir.data_strategies import TrainTestSplit
    from lir.datasets.synthesized_normal_binary import SynthesizedNormalData, SynthesizedNormalBinaryData

    # set the parameters for H1 data and H2 data
    h1_data = SynthesizedNormalData(mean=1, std=1, size=1000)
    h2_data = SynthesizedNormalData(mean=-1, std=1, size=1000)

    # generate the data
    instances = SynthesizedNormalBinaryData(h1_data, h2_data, seed=42).get_instances()

    # split the data into a 50% training set and a 50% test set
    training_instances, test_instances = next(TrainTestSplit(test_size=.5).apply(instances))

    # build a simple LR system for these data
    calibrator = LogitCalibrator()

    # train the system on the training set, and calculate the LLRs for the test set
    llrs = calibrator.fit(training_instances).apply(test_instances)

    # assess performance
    print('results for well-calibrated LLR values')
    llr_metrics_and_visualizations(llrs)


Observe that, for discriminative and well-calibrated LLRs:

- the value for ``cllr`` is lower than 1;
- the value for ``cllr_min`` is close to ``cllr``;
- the value for ``cllr_cal`` is close to 0;
- the LR histogram shows distinct distributions;
- in the LR histogram, the peak of the overlap of both distributions is at 0;
- the PAV plot approximately follows the diagonal;
- in the ECE plot, the LRs line is close to the PAV-LRs line, and the reference line is wel above both of them.


Badly calibrated data
^^^^^^^^^^^^^^^^^^^^^

LR systems may misbehave in several ways, resulting in inconsistent LLRs.
If this happens, check if the the training data is suitable for the test data. Inconsistent LLRs can be caused, for
example, when the training data are measurements of a different type of glass, when training data are from voice
recordings of a microphone versus telephone interception in the test data, or any other kind of mismatch between the
training set and the test set.

LLRs may be inconsistent in several ways, including:

- bias towards H1, meaning that the LLRs are too big;
- bias towards H2, meaning that the LLRs are too small;
- overestimation, meaning that the LLRs are too extreme;
- underestimation, meaning that the LLRs are too close to 0.

Below are the results of each of such inconsistent sets of LLRs.
Let's have a look at the metrics and visualizations for each of those.


.. jupyter-execute::

    print('all LLR values are *shifted* towards H1')
    biased_llrs_towards_h1 = llrs.replace(features=llrs.features + 2)
    llr_metrics_and_visualizations(biased_llrs_towards_h1)


Observations for LLRs that are biased towards H1:

- the value for ``cllr`` is increased from well-calibrated data;
- the value for ``cllr_min`` is the same as in well-calibrated data;
- the value for ``cllr_cal`` is greater than 0;
- the LR histogram still shows distinct distributions, but they are shifted to the **right**;
- in the LR histogram, the peak of the overlap of both distributions is to the right of 0;
- the PAV plot is **below** the diagonal;
- in the ECE plot, the LRs line is evidently above the PAV-LRs line and closer to the reference line (if the LLRs are
  wildly biased, the LRs line may even be partially above the reference line);
- the Tippett plot is shifted to the left.


.. jupyter-execute::

    print('all LLR values are *shifted* towards H2')
    biased_llrs_towards_h2 = llrs.replace(features=llrs.features - 2)
    llr_metrics_and_visualizations(biased_llrs_towards_h2)


Observations for LLRs that are biased towards H2:

- the value for ``cllr``, ``cllr_min`` and ``cllr_cal`` are similar to data biased towards H1;
- the LR histogram still shows distinct distributions, but they are shifted to the **left**;
- in the LR histogram, the peak of the overlap of both distributions is to the right of 0;
- the PAV plot is **above** the diagonal;
- in the ECE plot, the LRs line is evidently above the PAV-LRs line, similar to the LLRs shifted towards H1;
- the Tippett plot is shifted to the right.


.. jupyter-execute::

    print('the LLRs are *increased* towards the extremes on both sides')
    overestimated_llrs = llrs.replace(features=llrs.features * 2)
    llr_metrics_and_visualizations(overestimated_llrs)


Observations for overestimated LLRs:

- the value for ``cllr``, ``cllr_min`` and ``cllr_cal`` are similar to biased data;
- the LR histogram still shows distinct distributions, but the scale on the X-axis is increased;
- the PAV plot is flatter than the diagonal, and crosses it near the origin;
- in the ECE plot, the LRs line is slightly further away from the PAV-LRs line, and may cross the reference line near
  the extremes;
- the scale of the Tippett plot is increased.


.. jupyter-execute::

    print('the LLRs are *reduced* towards neutrality')
    underestimated_llrs = llrs.replace(features=llrs.features / 2)
    llr_metrics_and_visualizations(underestimated_llrs)


Observations for overestimated LLRs:

- the value for ``cllr``, ``cllr_min`` and ``cllr_cal`` are similar to biased data;
- the LR histogram still shows distinct distributions, but the scale on the X-axis is increased;
- the PAV plot is steeper than the diagonal, and crosses it near the origin;
- in the ECE plot, the LRs line is slightly further away from the PAV-LRs line;
- the scale of the Tippett plot is decreased.


That's all for now!