Overview of the Python API ========================== This is an introduction the Python API. You will learn the basic concepts of data handling and LR calculation in LiR. Data classes ------------ In LiR, a dataset is represented as an `InstanceData`_ object, be it numeric features, scores, LLRs, or something else. Specialized sub classes are: - `FeatureData`_, for instances which has numerical features (sub class of ``InstanceData``); - `PairedFeatureData`_, for pairs of instances that have numerical features (sub class of ``FeatureData``); - `LLRData`_, for LLRs, with or without intervals (sub class of ``FeatureData``). .. _InstanceData: api/lir.data.html#lir.data.models.InstanceData .. _FeatureData: api/lir.data.html#FeatureData .. _PairedFeatureData: api/lir.data.html#PairedFeatureData .. _LLRData: api/lir.data.html#LLRData These objects can be instantiated manually, but in an experimental setup, they are generally provided by a `DataProvider`_. A ``DataProvider`` is a class that is specialized in generating, parsing or fetching a particular type of data. A data provider subclass implements the ``get_instances()`` method. Example for glass data: .. jupyter-execute:: import numpy as np from lir.datasets.glass import GlassData from lir.transform.distance import ManhattanDistance from lir.algorithms.logistic_regression import LogitCalibrator # retrieve the data data_provider = GlassData(cache_dir='cache') glass_data = data_provider.get_instances() print(f'The glass dataset has numeric features, so it is of type {type(glass_data)}.') print(f'It has {len(glass_data)} instances, each of which is a measurement on a range of chemical elements.') # get the number of instances for each unique source id unique_source_ids, instance_count_by_source_id = np.unique(glass_data.source_ids, return_counts=True) print(f'The measurements are of {len(unique_source_ids)} different sources, i.e. glass fragments.') # get the number of sources with this many instances instance_counts, source_counts = np.unique(instance_count_by_source_id, return_counts=True) for instance_count, source_count in zip(instance_counts, source_counts): print(f'There are {source_count} sources with {instance_count} instances.') Most LR systems compare two instances, one from the trace source, and one from a reference source, and make an inference about source identity. We already acquired measurements on single glass fragments. So we `pair`_ the instances to get some pairs to compare. .. _pair: reference.html#pairing-methods .. jupyter-execute:: from lir.transform.pairing import SourcePairing # combine the instances into pairs pairing_method = SourcePairing(ratio_limit=1) pairs = pairing_method.pair(glass_data, n_ref_instances=1, n_trace_instances=1) print(f'We have combined the {len(glass_data)} instances into {len(pairs)} pairs.') print(f'The paired data has features of all instances in the pairs, so it has type {type(pairs)}.') print(f'Of {len(pairs[pairs.labels==1])} pairs, both instances are from the same source.') print(f'Of {len(pairs[pairs.labels==0])} pairs, both instances are from different sources.') We have created a same-source pair for each source that has at least two instances. The number of different-source pairs is potentially much larger, but the number of pairs created is limited to the number of same-source pairs by the ``ratio_limit`` argument. Some LR systems may be able to work with multiple trace and reference instances in each pair. This is relevant if there are repetitive measurements of the same source. The code below creates pairs of 2 trace instances and 3 reference instances, so we need sources with at least 5 instances for each same-source pair. .. jupyter-execute:: from lir.transform.pairing import SourcePairing # combine the instances into pairs pairing_method = SourcePairing(ratio_limit=1) pairs_3x2 = pairing_method.pair(glass_data, n_trace_instances=2, n_ref_instances=3) print(f'We have combined the {len(glass_data)} instances into {len(pairs_3x2)} pairs.') print(f'The paired data has features of all instances in the pairs, so it has type {type(pairs_3x2)}.') print(f'Of {len(pairs_3x2[pairs_3x2.labels==1])} pairs, all instances are from the same source.') print(f'Of {len(pairs_3x2[pairs_3x2.labels==0])} pairs, the instances are from two different sources.') The actual comparison can take many forms, including a distance or similarity function such as the `Manhattan distance`_. .. _Manhattan distance: api/lir.transform.html#lir.transform.distance.ManhattanDistance .. jupyter-execute:: from lir.transform.distance import ManhattanDistance from lir.algorithms.logistic_regression import LogitCalibrator # reduce the pairs to a single value by calculating the Manhattan distance distances = ManhattanDistance().apply(pairs) print(f'The set of distances has type {type(distances)}.') print(f'The distances have an average of {np.mean(distances.features)}.') print(f'The standard deviation is {np.std(distances.features)}.') Now it is time to calculate LLRs... .. jupyter-execute:: # calculate LLRs llrs = LogitCalibrator().fit_apply(distances) different_source_llrs = llrs[llrs.labels==0] same_source_llrs = llrs[llrs.labels==1] print(f'The set of LLRs has type {type(llrs)}.') print(f'The median LLR for different-source pairs is {np.median(different_source_llrs.llrs)}.') print(f'The median LLR for same-source pairs is {np.median(same_source_llrs.llrs)}.') Data strategies --------------- Above, we training the system and calculated LLRs using the same pairs, which is **not** a sound experimental setup! In an experiment we work with `data strategies`_. This can be a simple train/test split, or a more advanced configuration such as cross-validation. A data strategy inherits from `DataStrategy`_ and implements an ``apply()`` method that returns an iterator of pairs of training and test sets. .. _data strategies: api/lir.data.html#module-lir.data_strategies .. _DataProvider: api/lir.data.html#lir.data.models.DataProvider .. _DataStrategy: api/lir.data.html#lir.data.models.DataStrategy .. _Pipeline: api/lir.transform.html#lir.transform.pipeline.Pipeline Example: .. jupyter-execute:: from lir.data_strategies import SourcesTrainTestSplit splitter = SourcesTrainTestSplit(test_size=0.5) ((training_data, test_data),) = splitter.apply(glass_data) print(f'We have {len(training_data)} instances available for training our models.') print(f'We have {len(test_data)} instances available as test data.') LR systems ---------- There can be many different `LR system architectures`_. Refer to the `selection guide`_ if unsure which one to use. The exact parameters depend on the type of LR system. The main ingredient is typically a pipeline of *modules*. The modules are executed one by one, and each module takes the output of the previous module, transforms the data, and passes the data to the next module. The modules may reduce or expand the number of features, but never change the number of instances or pairs. An LR system may have multiple modules or pipelines as its arguments, each of which has a different role. For example, a score-based LR system has a preprocessing module, to process single instances, and a comparing module or pipeline, to calculate LLRs for the pairs. The `Pipeline`_ class accepts any LiR module scikit-learn transformer, scikit-learn estimator, or even other pipelines as its modules, as long as the module can work with the data. For example, the ``StandardScaler`` transforms the data to have mean = 0 and standard deviation = 1. Example: .. jupyter-execute:: from sklearn.preprocessing import StandardScaler from lir.transform.pipeline import Pipeline preprocessing_pipeline = Pipeline([ ('scale', StandardScaler()), ]) normalized_glass_data = preprocessing_pipeline.fit_apply(glass_data) A simple LR system could take the manhattan distance of each pair and then apply some kind of calibration. We can combine those steps in a pipeline. .. jupyter-execute:: from lir.transform.distance import ManhattanDistance from lir.algorithms.logistic_regression import LogitCalibrator comparing_pipeline = Pipeline([ ('distance', ManhattanDistance()), ('logit', LogitCalibrator()), ]) llrs = comparing_pipeline.fit_apply(pairs) We can use these components in a score-based LR system: .. jupyter-execute:: from lir.lrsystems.score_based import ScoreBasedSystem # initialize the score-based LR system with the components we created before lrsystem = ScoreBasedSystem(preprocessing_pipeline=preprocessing_pipeline, pairing_function=pairing_method, evaluation_pipeline=comparing_pipeline) # use the training data to fit the LR system lrsystem.fit(training_data) # use the test data to calculate LLRs llrs = lrsystem.apply(test_data) # plot results import lir.plotting with lir.plotting.show() as ax: ax.lr_histogram(llrs) # zoom in on the LLRs around 0 with lir.plotting.show() as ax: ax.set_xlim(-8, 3) ax.lr_histogram(llrs, bins=100) Above, we used a simple train/test split. Alternatives such as **cross-validation** use the data more efficiently, but we have to deal with multiple train/test splits. .. jupyter-execute:: from lir.data_strategies import SourcesCrossValidation from lir.data.models import concatenate_instances # initialize 5-fold cross-validation splitter = SourcesCrossValidation(folds=5) # initialize the results as an empty list results = [] for training_data, test_data in splitter.apply(glass_data): # since we do five-fold cross-validation, we have five different # train/test splits of the same data # fitting and applying an LR system can be a one-liner subset_llrs = lrsystem.fit(training_data).apply(test_data) # add the LLRs to results list results.append(subset_llrs) # combine all LLRs into a single object llrs = concatenate_instances(*results) # zoom in on the LLRs around 0 with lir.plotting.show() as ax: ax.set_xlim(-8, 3) ax.lr_histogram(llrs, bins=100) This time, we have twice as much test data altogether, because each pair appears in one of the test sets. Results should look similar compared to the train/test split. Is this an adequate LR system? **Next step:** learn how to `assess an LR system's performance`_. .. _LR system architectures: reference.html#lrsystem-architecture .. _selection guide: reference.html#lrsystem_yaml .. _assess an LR system's performance: lrsystem-assessment.html