lir.data package
Subpackages
Submodules
lir.data.data_strategies module
- class lir.data.data_strategies.BinaryCrossValidation(folds: int, seed: int | None = None)
Bases:
DataStrategyRepresentation of a K-fold cross validation iterator over each train/test split fold.
The input data should have class labels. This split assigns instances of both classes to each “fold” subset.
This method might be referenced in the YAML registry as follows: ``` data_strategies:
binary_cross_validation: lir.data.data_strategies.BinaryCrossValidation
In the benchmark configuration YAML, this validation can be referenced as follows: ``` splits:
strategy: binary_cross_validation
- apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]
Allow iteration by looping over the resulting train/test split(s).
- class lir.data.data_strategies.BinaryTrainTestSplit(test_size: float | int, seed: int | None = None)
Bases:
DataStrategyRepresentation of a train/test split.
The input data should have hypothesis labels. This split assigns instances of both classes to the training set and the test set.
- apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]
Allow iteration by looping over the resulting train/test split(s).
- class lir.data.data_strategies.MulticlassCrossValidation(folds: int)
Bases:
DataStrategyRepresentation of a K-fold cross validation iterator over train/test splits.
The input data should have source_ids. This split assigns all instances of a source to the same “fold” subset.
This method might be referenced in the YAML registry as follows: ``` data_strategies:
multiclass_cross_validation: lir.data.data_strategies.MulticlassCrossValidation
In the benchmark configuration YAML, this validation can be referenced as follows: ``` data:
[…] splits:
strategy: multiclass_cross_validation folds: 5
- apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]
Allow iteration by looping over the resulting train/test split(s).
- class lir.data.data_strategies.MulticlassTrainTestSplit(test_size: float | int, seed: int | None = None)
Bases:
DataStrategyRepresentation of a multi-class train/test split.
The input data should have source_ids. This split assigns all instances of a source to either the training set or the test set.
- apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]
Allow iteration by looping over the resulting train/test split(s).
- class lir.data.data_strategies.PredefinedTrainTestSplit
Bases:
DataStrategySplits data into a training set and a test set, according to pre-existing assignments in the data.
Presumes a role_assignments field in the data, which has the value “train” for instances that will be part of the training set, and “test” for instances in the test set.
In the benchmark configuration YAML, this validation can be referenced as follows: ``` cross_validation_splits:
strategy: predefined_train_test_split data_origin: ${data}
- apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]
Split the FeatureData into a train and a test split.
lir.data.io module
- class lir.data.io.DataFileBuilderCsv(path: Path, write_mode: str = 'w', write_header: bool | None = None)
Bases:
objectDataFileBuilderCsv Class.
This class adds convenience methods to write data to an output CSV file.
- add_column(data: ndarray, header_prefix: str = '', dimension_headers: dict[int, list[str]] | None = None) None
Append data and corresponding headers to self._all_data and self._all_headers.
The data argument is an arbitrary numpy array. Its first dimension are the rows. Any other dimension will be columns in the CSV output.
- Parameters:
header_prefix – the prefix for all headers
dimension_headers – a mapping from dimensions to its headers; the dimension corresponds to the dimensions of the data. Because dimension 0 corresponds to rows, it should have no headers
data
- write() None
Write data to CSV file.
- class lir.data.io.RemoteResource(url: str, local_directory: Path)
Bases:
objectProvide method to open files from remote source.
This can be handy if any resource is located on e.g. a GitHub repository.
- open(filename: str, mode: str = 'r') IO[Any]
Return an open file stream for a remote resource.
- lir.data.io.search_path(path: Path) Path
Searches the python path for a file.
If path is absolute, it is normalized by Path.resolve() and returned.
If path is relative, the file is searched in sys.path. The path is interpreted as relative to sys.path elements one by one, and if it exists, it is normalized by Path.resolve() and returned.
If the file is not found, it is normalized and made absolute by Path.resolve() and returned.
lir.data.models module
- class lir.data.models.DataProvider
Bases:
ABCBase class for data providers.
Each data provider should provide access to instance data by implementing the get_instances() method.
- abstractmethod get_instances() FeatureData
Returns an InstanceData object, containing data for a set of instances.
- class lir.data.models.DataStrategy
Bases:
ABCBase class for data (splitting) strategies.
- abstractmethod apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]
Provide iterator to access training and test set.
Returns an iterator over tuples of a training set and a test set. Both the training set and the test is represented by an InstanceData object.
- class lir.data.models.FeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], **extra_data: Any)
Bases:
InstanceDataData class for feature data.
- - features
- Type:
an array of instance features, with one row per instance
- check_features() Self
Validate the features.
- check_matching_shapes() Self
Validate the shape of the features and the labels are matching.
- features: Annotated[ndarray, AfterValidator(func=_validate_features)]
- model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class lir.data.models.InstanceData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, **extra_data: Any)
Bases:
BaseModel,ABCBase class for data on instances.
- - labels
- Type:
an array of labels, a 1-dimensional array with one value per instance
- - source_ids
- Type:
an array of source ids, a 2-dimensional array with one column and one value per instance
- property all_fields: list[str]
a list of all fields, including both mandatory and extra fields
- Type:
return
- apply(fn: Callable, *args: Any, **kwargs: Any) Self
Apply a custom function to this InstanceData object.
The function fn is applied to all Numpy fields. Other fields are copied as-is.
- check_sourceid_shape() Self
Validate the shape of the source_ids.
- check_sourceids_labels_match() Self
Validate the source_ids and labels have matching shapes.
- combine(others: list[InstanceData] | InstanceData, fn: Callable, *args: Any, **kwargs: Any) Self
Apply a custom combination function to InstanceData objects.
All objects must have the same types and fields, and the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using fn. Other fields are copied as-is.
- concatenate(*others: InstanceData) Self
Concatenate instances from InstanceData objects.
All concatenated objects must have the same types and fields. How fields are concatenated may depend on the subclass. By default, they must have the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using np.concatenate. Other fields are copied as-is.
Returns a new object with the concatenated instances.
- property has_labels: bool
True iff the instances are labeled
- Type:
return
- has_same_type(other: Any) bool
Compare these instance data to another class.
Returns True iff: - other has the same class - other has the same fields - all fields have the same type
- labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)]
- model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- replace(**kwargs: Any) Self
Returns a modified copy with updated values.
- Parameters:
kwargs – the fields to replace
- Returns:
the modified copy
- replace_as(datatype: type[InstanceDataType], **kwargs: Any) InstanceDataType
Returns a modified copy with updated data type and values.
- Parameters:
datatype – the return type
kwargs – the fields to replace
- Returns:
the modified copy
- property require_labels: ndarray
Returns labels and guarantee that it is not None (or raise an error).
- source_ids: ndarray | None
- class lir.data.models.LLRData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], llr_upper_bound: float | None = None, llr_lower_bound: float | None = None, **extra_data: Any)
Bases:
FeatureDataRepresentation of calculated LLR values.
- - llrs
- Type:
1-dimensional numpy array of LLR values
- - has_intervals
- Type:
indicate whether the LLR’s have intervals
- - llr_intervals
- Type:
numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals
- - llr_upper_bound
- Type:
upper bound applied to the LLRs, or None if no upper bound was applied
- - llr_lower_bound
- Type:
lower bound applied to the LLRs, or None if no lower bound was applied
- check_features_are_llrs() Self
Validate the feature data.
- check_misleading_finite() None
Check whether all values are either finite or not misleading.
- property has_intervals: bool
indicate whether the LLR’s have intervals
- Type:
return
- property llr_bounds: tuple[float | None, float | None]
a tuple (min_llr, max_llr)
- Type:
return
- property llr_intervals: ndarray | None
numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals
- Type:
return
- llr_lower_bound: float | None
- llr_upper_bound: float | None
- property llrs: ndarray
1-dimensional numpy array of LLR values
- Type:
return
- model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class lir.data.models.PairedFeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], n_trace_instances: int, n_ref_instances: int, **extra_data: Any)
Bases:
FeatureDataData class for instance pair data.
- - n_trace_instances
- Type:
the number of trace instances in each pair
- - n_ref_instances
- Type:
the number of reference instances in each pair
- - features
second
- Type:
the features of all instances in the pair, with pairs along the first dimension, and instances along the
- - source_ids
columns
- Type:
the source ids of the trace and reference instances of each pair, a 2-dimensional array with two
- - features_trace
- Type:
the features of the trace instances
- - features_ref
- Type:
the features of the reference instances
- - source_ids_trace
- Type:
the source ids of the trace instances
- - source_ids_ref
- Type:
the source ids of the reference instances
- check_features_dimensions() Self
Validate feature dimensions.
- check_sourceid_shape() Self
Overrides the InstanceData implementation.
- property features_ref: ndarray
Get the features of the reference instances.
- property features_trace: ndarray
Get the features of the trace instances.
- model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- n_ref_instances: int
- n_trace_instances: int
- property source_ids_ref: ndarray | None
Get the source ids of the reference instances.
- property source_ids_trace: ndarray | None
Get the source ids of the trace instances.
- lir.data.models.concatenate_instances(first: InstanceDataType, *others: InstanceDataType) InstanceDataType
Concatenate the results of the InstanceData objects.
Alias for first.concatenate(*others).