lir.data.datasets package
Submodules
lir.data.datasets.alcohol_breath_analyser module
- class lir.data.datasets.alcohol_breath_analyser.AlcoholBreathAnalyser(ill_calibrated: bool = False)
Bases:
DataProviderAlcohol Breath Analyser example class.
- Example from paper:
Peter Vergeer, Andrew van Es, Arent de Jongh, Ivo Alberink and Reinoud Stoel, Numerical likelihood ratios outputted by LR systems are often based on extrapolation: When to stop extrapolating? In: Science and Justice 56 (2016) 482–491.
lir.data.datasets.feature_data_csv module
- class lir.data.datasets.feature_data_csv.FeatureDataCsvFileParser(file: PathLike, **kwargs: Any)
Bases:
FeatureDataCsvParserRead CSV data from file.
- get_instances() FeatureData
Retrieve FeatureData instances.
- class lir.data.datasets.feature_data_csv.FeatureDataCsvHttpParser(url: str, session: Session, **kwargs: Any)
Bases:
FeatureDataCsvParserRead data from a stream.
- get_instances() FeatureData
Retrieve FeatureData from the remote resource.
- class lir.data.datasets.feature_data_csv.FeatureDataCsvParser(source_id_column: str | None = None, label_column: str | None = None, instance_id_column: str | None = None, role_assignment_column: str | None = None, ignore_columns: list[str] | None = None, message_prefix: str = '')
Bases:
DataProvider,ABCParses a csv file into a FeatureData object.
This is an abstract class with implementations for different sources: - for reading from a local file, use FeatureDataCsvFileParser; - for reading from a URL, use FeatureDataCsvHttpParser; - for reading from a stream, use FeatureDataCsvStreamParser.
Example: let’s say we have data with two features and source ids.
`csv source_id,feature1,feature2,feature3,name_of_an_irrelevant_column 0,1,10,1,sherlock 0,1,11,1,holmes 1,20,30,1,irene 1,18,32,3,adler 2,5,10,8,professor 2,1,11,8,moriarty `This file can be parsed from the following YAML: ```yaml data:
provider: feature_data_csv path: path/to/file.csv source_id_column: source_id ignore_columns:
name_of_an_irrelevant_column
- class lir.data.datasets.feature_data_csv.FeatureDataCsvStreamParser(fp: IO, **kwargs: Any)
Bases:
FeatureDataCsvParserRead data from a streamed CSV.
- get_instances() FeatureData
Retrieve FeatureData instances from CSV stream.
lir.data.datasets.glass module
- class lir.data.datasets.glass.GlassData(cache_dir: PathLike)
Bases:
DataProviderLA-ICP-MS measurements of elemental concentration from floatglass.
The measurements are from reference glass from casework, collected in the past 10 years or so. For more info on the DataProvider, see: https://github.com/NetherlandsForensicInstitute/elemental_composition_glass
This data provider has a pre-defined train/test split, with a training set of three instances per source, and a test set of five instances per source.
Data are retrieved from the web as needed and stored locally for later use.
- get_instances() FeatureData
Returns data with pre-defined assignments of training data and test data.
The training data is read from training.csv and has three instances (replicates) per source. The test data is read from duplo.csv and triplo.csv and has a total of five instances per source.
The features are elemental concentrations on a log_10 basis, and normalized to Si. The elements are: K39, Ti49, Mn55, Rb85, Sr88, Zr90, Ba137, La139, Ce140, Pb208
The source_ids are unique identifiers of a glass particle. Each particle is from a different reference window. An instance is a replicate measurement on a glass particle. Source ids are prefixed with the role assignment, e.g. ‘test-123’ and ‘train-123’. The ids ‘test-123’ and ‘train-123’ refer to different glass particles (and therefore different reference windows).
The instance_ids values of an instance are a concatenation of the filename and a row number, e.g. “training.csv:22”.
The data are returned as a FeatureData object with the following properties: - features: an (n, 10) array of feature values - source_ids: a 1d array of source ids (str) - instance_ids: a 1d array of unique instance ids (str) - role_assignments: a 1d array of role assignments (values “train” or “test”)
lir.data.datasets.synthesized_normal_binary module
- class lir.data.datasets.synthesized_normal_binary.SynthesizedNormalBinaryData(data_classes: Mapping[Any, SynthesizedNormalDataClass], seed: int)
Bases:
DataProviderImplementation of a data source generating normally distributed binary class data.
- get_instances() FeatureData
Returns instances with randomly synthesized data and binary labels.
The features are drawn from a normal distribution, as configured. The meta data vector is empty, with dimensions (n, 0).
- class lir.data.datasets.synthesized_normal_binary.SynthesizedNormalDataClass(mean: float, std: float, size: int | tuple[int, int])
Bases:
objectRepresentation of normally distributed data, leveraging a number generator.
The generated data can be used to generate normally distributed data and is useful for debugging purposes or gaining insight in the effect of varying parts within the LR system pipeline.
- get(rng: Generator) ndarray
Draw random samples from a normally distributed data set.
lir.data.datasets.synthesized_normal_multiclass module
- class lir.data.datasets.synthesized_normal_multiclass.SynthesizedDimension(population_mean: float, population_std: float, sources_std: float)
Bases:
NamedTupleRepresentation of a data distribution.
- population_mean: float
Alias for field number 0
- population_std: float
Alias for field number 1
- sources_std: float
Alias for field number 2
- class lir.data.datasets.synthesized_normal_multiclass.SynthesizedNormalMulticlassData(dimensions: list[SynthesizedDimension], population_size: int, sources_size: int, seed: int | None)
Bases:
DataProviderImplementation of a data source generating normally distributed multiclass data.
- get_instances() FeatureData
Return instances with randomly synthesized data and multi-class labels.
The features are drawn from a normal distribution, as configured.