lir.data.datasets package

Submodules

lir.data.datasets.alcohol_breath_analyser module

class lir.data.datasets.alcohol_breath_analyser.AlcoholBreathAnalyser(ill_calibrated: bool = False)

Bases: DataProvider

Alcohol Breath Analyser example class.

Example from paper:: Peter Vergeer, Andrew van Es, Arent de Jongh, Ivo Alberink and Reinoud Stoel, Numerical likelihood ratios outputted by LR systems are often based on extrapolation: When to stop extrapolating? In: Science and Justice 56 (2016) 482–491.

get_instances() → LLRData: Provide LLR data for example system.

lir.data.datasets.feature_data_csv module

class lir.data.datasets.feature_data_csv.FeatureDataCsvFileParser(file: PathLike, **kwargs: Any)

Bases: FeatureDataCsvParser

Read CSV data from file.

get_instances() → FeatureData: Retrieve FeatureData instances.

class lir.data.datasets.feature_data_csv.FeatureDataCsvHttpParser(url: str, session: Session, **kwargs: Any)

Bases: FeatureDataCsvParser

Read data from a stream.

get_instances() → FeatureData: Retrieve FeatureData from the remote resource.

class lir.data.datasets.feature_data_csv.FeatureDataCsvParser(source_id_column: str | None = None, label_column: str | None = None, instance_id_column: str | None = None, role_assignment_column: str | None = None, ignore_columns: list[str] | None = None, message_prefix: str = '')

Bases: DataProvider, ABC

Parses a csv file into a FeatureData object.

This is an abstract class with implementations for different sources: - for reading from a local file, use FeatureDataCsvFileParser; - for reading from a URL, use FeatureDataCsvHttpParser; - for reading from a stream, use FeatureDataCsvStreamParser.

Example: let’s say we have data with two features and source ids.

`csv source_id,feature1,feature2,feature3,name_of_an_irrelevant_column 0,1,10,1,sherlock 0,1,11,1,holmes 1,20,30,1,irene 1,18,32,3,adler 2,5,10,8,professor 2,1,11,8,moriarty `

This file can be parsed from the following YAML: ```yaml data:

provider: feature_data_csv path: path/to/file.csv source_id_column: source_id ignore_columns:

name_of_an_irrelevant_column

```

class lir.data.datasets.feature_data_csv.FeatureDataCsvStreamParser(fp: IO, **kwargs: Any)

Bases: FeatureDataCsvParser

Read data from a streamed CSV.

get_instances() → FeatureData: Retrieve FeatureData instances from CSV stream.

lir.data.datasets.glass module

class lir.data.datasets.glass.GlassData(cache_dir: PathLike)

Bases: DataProvider

LA-ICP-MS measurements of elemental concentration from floatglass.

The measurements are from reference glass from casework, collected in the past 10 years or so. For more info on the DataProvider, see: https://github.com/NetherlandsForensicInstitute/elemental_composition_glass

This data provider has a pre-defined train/test split, with a training set of three instances per source, and a test set of five instances per source.

Data are retrieved from the web as needed and stored locally for later use.

get_instances() → FeatureData

Returns data with pre-defined assignments of training data and test data.

The training data is read from training.csv and has three instances (replicates) per source. The test data is read from duplo.csv and triplo.csv and has a total of five instances per source.

The features are elemental concentrations on a log_10 basis, and normalized to Si. The elements are: K39, Ti49, Mn55, Rb85, Sr88, Zr90, Ba137, La139, Ce140, Pb208

The source_ids are unique identifiers of a glass particle. Each particle is from a different reference window. An instance is a replicate measurement on a glass particle. Source ids are prefixed with the role assignment, e.g. ‘test-123’ and ‘train-123’. The ids ‘test-123’ and ‘train-123’ refer to different glass particles (and therefore different reference windows).

The instance_ids values of an instance are a concatenation of the filename and a row number, e.g. “training.csv:22”.

The data are returned as a FeatureData object with the following properties: - features: an (n, 10) array of feature values - source_ids: a 1d array of source ids (str) - instance_ids: a 1d array of unique instance ids (str) - role_assignments: a 1d array of role assignments (values “train” or “test”)

lir.data.datasets.synthesized_normal_binary module

class lir.data.datasets.synthesized_normal_binary.SynthesizedNormalBinaryData(data_classes: Mapping[Any, SynthesizedNormalDataClass], seed: int)

Bases: DataProvider

Implementation of a data source generating normally distributed binary class data.

get_instances() → FeatureData

Returns instances with randomly synthesized data and binary labels.

The features are drawn from a normal distribution, as configured. The meta data vector is empty, with dimensions (n, 0).

class lir.data.datasets.synthesized_normal_binary.SynthesizedNormalDataClass(mean: float, std: float, size: int | tuple[int, int])

Bases: object

Representation of normally distributed data, leveraging a number generator.

The generated data can be used to generate normally distributed data and is useful for debugging purposes or gaining insight in the effect of varying parts within the LR system pipeline.

get(rng: Generator) → ndarray: Draw random samples from a normally distributed data set.

lir.data.datasets.synthesized_normal_multiclass module

class lir.data.datasets.synthesized_normal_multiclass.SynthesizedDimension(population_mean: float, population_std: float, sources_std: float)

Bases: NamedTuple

Representation of a data distribution.

population_mean: float: Alias for field number 0

population_std: float: Alias for field number 1

sources_std: float: Alias for field number 2

class lir.data.datasets.synthesized_normal_multiclass.SynthesizedNormalMulticlassData(dimensions: list[SynthesizedDimension], population_size: int, sources_size: int, seed: int | None)

Bases: DataProvider

Implementation of a data source generating normally distributed multiclass data.

get_instances() → FeatureData

Return instances with randomly synthesized data and multi-class labels.

The features are drawn from a normal distribution, as configured.

lir.data.datasets package

Submodules

lir.data.datasets.alcohol_breath_analyser module

lir.data.datasets.feature_data_csv module

lir.data.datasets.glass module

lir.data.datasets.synthesized_normal_binary module

lir.data.datasets.synthesized_normal_multiclass module

Module contents