lir.data package

Subpackages

lir.data.datasets package

Submodules

lir.data.data_strategies module

class lir.data.data_strategies.BinaryCrossValidation(folds: int, seed: int | None = None)

Bases: DataStrategy

Representation of a K-fold cross validation iterator over each train/test split fold.

The input data should have class labels. This split assigns instances of both classes to each “fold” subset.

This method might be referenced in the YAML registry as follows: ``` data_strategies:

binary_cross_validation: lir.data.data_strategies.BinaryCrossValidation

```

In the benchmark configuration YAML, this validation can be referenced as follows: ``` splits:

strategy: binary_cross_validation

```

apply(instances: FeatureData) → Iterable[tuple[FeatureData, FeatureData]]: Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.BinaryTrainTestSplit(test_size: float | int, seed: int | None = None)

Bases: DataStrategy

Representation of a train/test split.

The input data should have hypothesis labels. This split assigns instances of both classes to the training set and the test set.

apply(instances: FeatureData) → Iterable[tuple[FeatureData, FeatureData]]: Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.MulticlassCrossValidation(folds: int)

Bases: DataStrategy

Representation of a K-fold cross validation iterator over train/test splits.

The input data should have source_ids. This split assigns all instances of a source to the same “fold” subset.

This method might be referenced in the YAML registry as follows: ``` data_strategies:

multiclass_cross_validation: lir.data.data_strategies.MulticlassCrossValidation

```

In the benchmark configuration YAML, this validation can be referenced as follows: ``` data:

[…] splits:

strategy: multiclass_cross_validation folds: 5

```

apply(instances: FeatureData) → Iterable[tuple[FeatureData, FeatureData]]: Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.MulticlassTrainTestSplit(test_size: float | int, seed: int | None = None)

Bases: DataStrategy

Representation of a multi-class train/test split.

The input data should have source_ids. This split assigns all instances of a source to either the training set or the test set.

apply(instances: FeatureData) → Iterable[tuple[FeatureData, FeatureData]]: Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.PredefinedTrainTestSplit

Bases: DataStrategy

Splits data into a training set and a test set, according to pre-existing assignments in the data.

Presumes a role_assignments field in the data, which has the value “train” for instances that will be part of the training set, and “test” for instances in the test set.

In the benchmark configuration YAML, this validation can be referenced as follows: ``` cross_validation_splits:

strategy: predefined_train_test_split data_origin: ${data}

```

apply(instances: FeatureData) → Iterable[tuple[FeatureData, FeatureData]]: Split the FeatureData into a train and a test split.

class lir.data.data_strategies.RoleAssignment(*values)

Bases: Enum

Indicate whether the data is part of the train or the test split.

TEST = 'test'

TRAIN = 'train'

lir.data.io module

class lir.data.io.DataFileBuilderCsv(path: Path, write_mode: str = 'w', write_header: bool | None = None)

Bases: object

DataFileBuilderCsv Class.

This class adds convenience methods to write data to an output CSV file.

add_column(data: ndarray, header_prefix: str = '', dimension_headers: dict[int, list[str]] | None = None) → None

Append data and corresponding headers to self._all_data and self._all_headers.

The data argument is an arbitrary numpy array. Its first dimension are the rows. Any other dimension will be columns in the CSV output.

Parameters:

header_prefix – the prefix for all headers
dimension_headers – a mapping from dimensions to its headers; the dimension corresponds to the dimensions of the data. Because dimension 0 corresponds to rows, it should have no headers
data

write() → None: Write data to CSV file.

class lir.data.io.RemoteResource(url: str, local_directory: Path)

Bases: object

Provide method to open files from remote source.

This can be handy if any resource is located on e.g. a GitHub repository.

open(filename: str, mode: str = 'r') → IO[Any]: Return an open file stream for a remote resource.

lir.data.io.search_path(path: Path) → Path

Searches the python path for a file.

If path is absolute, it is normalized by Path.resolve() and returned.

If path is relative, the file is searched in sys.path. The path is interpreted as relative to sys.path elements one by one, and if it exists, it is normalized by Path.resolve() and returned.

If the file is not found, it is normalized and made absolute by Path.resolve() and returned.

lir.data.models module

class lir.data.models.DataProvider

Bases: ABC

Base class for data providers.

Each data provider should provide access to instance data by implementing the get_instances() method.

abstractmethod get_instances() → FeatureData: Returns an InstanceData object, containing data for a set of instances.

class lir.data.models.DataStrategy

Bases: ABC

Base class for data (splitting) strategies.

abstractmethod apply(instances: FeatureData) → Iterable[tuple[FeatureData, FeatureData]]

Provide iterator to access training and test set.

Returns an iterator over tuples of a training set and a test set. Both the training set and the test is represented by an InstanceData object.

class lir.data.models.FeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], **extra_data: Any)

Bases: InstanceData

Data class for feature data.

- features

Type:: an array of instance features, with one row per instance

check_features() → Self: Validate the features.

check_matching_shapes() → Self: Validate the shape of the features and the labels are matching.

features: Annotated[ndarray, AfterValidator(func=_validate_features)]

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class lir.data.models.InstanceData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, **extra_data: Any)

Bases: BaseModel, ABC

Base class for data on instances.

- labels

Type:: an array of labels, a 1-dimensional array with one value per instance

- source_ids

Type:: an array of source ids, a 2-dimensional array with one column and one value per instance

property all_fields: list[str]

a list of all fields, including both mandatory and extra fields

Type:: return

apply(fn: Callable, *args: Any, **kwargs: Any) → Self

Apply a custom function to this InstanceData object.

The function fn is applied to all Numpy fields. Other fields are copied as-is.

check_sourceid_shape() → Self: Validate the shape of the source_ids.

check_sourceids_labels_match() → Self: Validate the source_ids and labels have matching shapes.

combine(others: list[InstanceData] | InstanceData, fn: Callable, *args: Any, **kwargs: Any) → Self

Apply a custom combination function to InstanceData objects.

All objects must have the same types and fields, and the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using fn. Other fields are copied as-is.

concatenate(*others: InstanceData) → Self

Concatenate instances from InstanceData objects.

All concatenated objects must have the same types and fields. How fields are concatenated may depend on the subclass. By default, they must have the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using np.concatenate. Other fields are copied as-is.

Returns a new object with the concatenated instances.

property has_labels: bool

True iff the instances are labeled

Type:: return

has_same_type(other: Any) → bool

Compare these instance data to another class.

Returns True iff: - other has the same class - other has the same fields - all fields have the same type

labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)]

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

replace(**kwargs: Any) → Self

Returns a modified copy with updated values.

Parameters:: kwargs – the fields to replace
Returns:: the modified copy

replace_as(datatype: type[InstanceDataType], **kwargs: Any) → InstanceDataType

Returns a modified copy with updated data type and values.

Parameters:

datatype – the return type
kwargs – the fields to replace

Returns:

the modified copy

property require_labels: ndarray: Returns labels and guarantee that it is not None (or raise an error).

source_ids: ndarray | None

class lir.data.models.LLRData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], llr_upper_bound: float | None = None, llr_lower_bound: float | None = None, **extra_data: Any)

Bases: FeatureData

Representation of calculated LLR values.

- llrs

Type:: 1-dimensional numpy array of LLR values

- has_intervals

Type:: indicate whether the LLR’s have intervals

- llr_intervals

Type:: numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals

- llr_upper_bound

Type:: upper bound applied to the LLRs, or None if no upper bound was applied

- llr_lower_bound

Type:: lower bound applied to the LLRs, or None if no lower bound was applied

check_features_are_llrs() → Self: Validate the feature data.

check_misleading_finite() → None: Check whether all values are either finite or not misleading.

property has_intervals: bool

indicate whether the LLR’s have intervals

Type:: return

property llr_bounds: tuple[float | None, float | None]

a tuple (min_llr, max_llr)

Type:: return

property llr_intervals: ndarray | None

numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals

Type:: return

llr_lower_bound: float | None

llr_upper_bound: float | None

property llrs: ndarray

1-dimensional numpy array of LLR values

Type:: return

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class lir.data.models.PairedFeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], n_trace_instances: int, n_ref_instances: int, **extra_data: Any)

Bases: FeatureData

Data class for instance pair data.

- n_trace_instances

Type:: the number of trace instances in each pair

- n_ref_instances

Type:: the number of reference instances in each pair

- features

second

Type:: the features of all instances in the pair, with pairs along the first dimension, and instances along the

- source_ids

columns

Type:: the source ids of the trace and reference instances of each pair, a 2-dimensional array with two

- features_trace

Type:: the features of the trace instances

- features_ref

Type:: the features of the reference instances

- source_ids_trace

Type:: the source ids of the trace instances

- source_ids_ref

Type:: the source ids of the reference instances

check_features_dimensions() → Self: Validate feature dimensions.

check_sourceid_shape() → Self: Overrides the InstanceData implementation.

property features_ref: ndarray: Get the features of the reference instances.

property features_trace: ndarray: Get the features of the trace instances.

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_ref_instances: int

n_trace_instances: int

property source_ids_ref: ndarray | None: Get the source ids of the reference instances.

property source_ids_trace: ndarray | None: Get the source ids of the trace instances.

lir.data.models.concatenate_instances(first: InstanceDataType, *others: InstanceDataType) → InstanceDataType

Concatenate the results of the InstanceData objects.

Alias for first.concatenate(*others).

lir.data package

Subpackages

Submodules

lir.data.data_strategies module

lir.data.io module

lir.data.models module

Module contents