lir.data package

Subpackages

Submodules

lir.data.data_strategies module

class lir.data.data_strategies.BinaryCrossValidation(folds: int, seed: int | None = None)

Bases: DataStrategy

Representation of a K-fold cross validation iterator over each train/test split fold.

The input data should have class labels. This split assigns instances of both classes to each “fold” subset.

This method might be referenced in the YAML registry as follows: ``` data_strategies:

binary_cross_validation: lir.data.data_strategies.BinaryCrossValidation

```

In the benchmark configuration YAML, this validation can be referenced as follows: ``` splits:

strategy: binary_cross_validation

```

apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]

Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.BinaryTrainTestSplit(test_size: float | int, seed: int | None = None)

Bases: DataStrategy

Representation of a train/test split.

The input data should have hypothesis labels. This split assigns instances of both classes to the training set and the test set.

apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]

Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.MulticlassCrossValidation(folds: int)

Bases: DataStrategy

Representation of a K-fold cross validation iterator over train/test splits.

The input data should have source_ids. This split assigns all instances of a source to the same “fold” subset.

This method might be referenced in the YAML registry as follows: ``` data_strategies:

multiclass_cross_validation: lir.data.data_strategies.MulticlassCrossValidation

```

In the benchmark configuration YAML, this validation can be referenced as follows: ``` data:

[…] splits:

strategy: multiclass_cross_validation folds: 5

```

apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]

Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.MulticlassTrainTestSplit(test_size: float | int, seed: int | None = None)

Bases: DataStrategy

Representation of a multi-class train/test split.

The input data should have source_ids. This split assigns all instances of a source to either the training set or the test set.

apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]

Allow iteration by looping over the resulting train/test split(s).

class lir.data.data_strategies.PredefinedTrainTestSplit

Bases: DataStrategy

Splits data into a training set and a test set, according to pre-existing assignments in the data.

Presumes a role_assignments field in the data, which has the value “train” for instances that will be part of the training set, and “test” for instances in the test set.

In the benchmark configuration YAML, this validation can be referenced as follows: ``` cross_validation_splits:

strategy: predefined_train_test_split data_origin: ${data}

```

apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]

Split the FeatureData into a train and a test split.

class lir.data.data_strategies.RoleAssignment(*values)

Bases: Enum

Indicate whether the data is part of the train or the test split.

TEST = 'test'
TRAIN = 'train'

lir.data.io module

class lir.data.io.DataFileBuilderCsv(path: Path, write_mode: str = 'w', write_header: bool | None = None)

Bases: object

DataFileBuilderCsv Class.

This class adds convenience methods to write data to an output CSV file.

add_column(data: ndarray, header_prefix: str = '', dimension_headers: dict[int, list[str]] | None = None) None

Append data and corresponding headers to self._all_data and self._all_headers.

The data argument is an arbitrary numpy array. Its first dimension are the rows. Any other dimension will be columns in the CSV output.

Parameters:
  • header_prefix – the prefix for all headers

  • dimension_headers – a mapping from dimensions to its headers; the dimension corresponds to the dimensions of the data. Because dimension 0 corresponds to rows, it should have no headers

  • data

write() None

Write data to CSV file.

class lir.data.io.RemoteResource(url: str, local_directory: Path)

Bases: object

Provide method to open files from remote source.

This can be handy if any resource is located on e.g. a GitHub repository.

open(filename: str, mode: str = 'r') IO[Any]

Return an open file stream for a remote resource.

lir.data.io.search_path(path: Path) Path

Searches the python path for a file.

If path is absolute, it is normalized by Path.resolve() and returned.

If path is relative, the file is searched in sys.path. The path is interpreted as relative to sys.path elements one by one, and if it exists, it is normalized by Path.resolve() and returned.

If the file is not found, it is normalized and made absolute by Path.resolve() and returned.

lir.data.models module

class lir.data.models.DataProvider

Bases: ABC

Base class for data providers.

Each data provider should provide access to instance data by implementing the get_instances() method.

abstractmethod get_instances() FeatureData

Returns an InstanceData object, containing data for a set of instances.

class lir.data.models.DataStrategy

Bases: ABC

Base class for data (splitting) strategies.

abstractmethod apply(instances: FeatureData) Iterable[tuple[FeatureData, FeatureData]]

Provide iterator to access training and test set.

Returns an iterator over tuples of a training set and a test set. Both the training set and the test is represented by an InstanceData object.

class lir.data.models.FeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], **extra_data: Any)

Bases: InstanceData

Data class for feature data.

- features
Type:

an array of instance features, with one row per instance

check_features() Self

Validate the features.

check_matching_shapes() Self

Validate the shape of the features and the labels are matching.

features: Annotated[ndarray, AfterValidator(func=_validate_features)]
model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class lir.data.models.InstanceData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, **extra_data: Any)

Bases: BaseModel, ABC

Base class for data on instances.

- labels
Type:

an array of labels, a 1-dimensional array with one value per instance

- source_ids
Type:

an array of source ids, a 2-dimensional array with one column and one value per instance

property all_fields: list[str]

a list of all fields, including both mandatory and extra fields

Type:

return

apply(fn: Callable, *args: Any, **kwargs: Any) Self

Apply a custom function to this InstanceData object.

The function fn is applied to all Numpy fields. Other fields are copied as-is.

check_sourceid_shape() Self

Validate the shape of the source_ids.

check_sourceids_labels_match() Self

Validate the source_ids and labels have matching shapes.

combine(others: list[InstanceData] | InstanceData, fn: Callable, *args: Any, **kwargs: Any) Self

Apply a custom combination function to InstanceData objects.

All objects must have the same types and fields, and the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using fn. Other fields are copied as-is.

concatenate(*others: InstanceData) Self

Concatenate instances from InstanceData objects.

All concatenated objects must have the same types and fields. How fields are concatenated may depend on the subclass. By default, they must have the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using np.concatenate. Other fields are copied as-is.

Returns a new object with the concatenated instances.

property has_labels: bool

True iff the instances are labeled

Type:

return

has_same_type(other: Any) bool

Compare these instance data to another class.

Returns True iff: - other has the same class - other has the same fields - all fields have the same type

labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)]
model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

replace(**kwargs: Any) Self

Returns a modified copy with updated values.

Parameters:

kwargs – the fields to replace

Returns:

the modified copy

replace_as(datatype: type[InstanceDataType], **kwargs: Any) InstanceDataType

Returns a modified copy with updated data type and values.

Parameters:
  • datatype – the return type

  • kwargs – the fields to replace

Returns:

the modified copy

property require_labels: ndarray

Returns labels and guarantee that it is not None (or raise an error).

source_ids: ndarray | None
class lir.data.models.LLRData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], llr_upper_bound: float | None = None, llr_lower_bound: float | None = None, **extra_data: Any)

Bases: FeatureData

Representation of calculated LLR values.

- llrs
Type:

1-dimensional numpy array of LLR values

- has_intervals
Type:

indicate whether the LLR’s have intervals

- llr_intervals
Type:

numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals

- llr_upper_bound
Type:

upper bound applied to the LLRs, or None if no upper bound was applied

- llr_lower_bound
Type:

lower bound applied to the LLRs, or None if no lower bound was applied

check_features_are_llrs() Self

Validate the feature data.

check_misleading_finite() None

Check whether all values are either finite or not misleading.

property has_intervals: bool

indicate whether the LLR’s have intervals

Type:

return

property llr_bounds: tuple[float | None, float | None]

a tuple (min_llr, max_llr)

Type:

return

property llr_intervals: ndarray | None

numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals

Type:

return

llr_lower_bound: float | None
llr_upper_bound: float | None
property llrs: ndarray

1-dimensional numpy array of LLR values

Type:

return

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class lir.data.models.PairedFeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: ndarray | None = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], n_trace_instances: int, n_ref_instances: int, **extra_data: Any)

Bases: FeatureData

Data class for instance pair data.

- n_trace_instances
Type:

the number of trace instances in each pair

- n_ref_instances
Type:

the number of reference instances in each pair

- features

second

Type:

the features of all instances in the pair, with pairs along the first dimension, and instances along the

- source_ids

columns

Type:

the source ids of the trace and reference instances of each pair, a 2-dimensional array with two

- features_trace
Type:

the features of the trace instances

- features_ref
Type:

the features of the reference instances

- source_ids_trace
Type:

the source ids of the trace instances

- source_ids_ref
Type:

the source ids of the reference instances

check_features_dimensions() Self

Validate feature dimensions.

check_sourceid_shape() Self

Overrides the InstanceData implementation.

property features_ref: ndarray

Get the features of the reference instances.

property features_trace: ndarray

Get the features of the trace instances.

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_ref_instances: int
n_trace_instances: int
property source_ids_ref: ndarray | None

Get the source ids of the reference instances.

property source_ids_trace: ndarray | None

Get the source ids of the trace instances.

lir.data.models.concatenate_instances(first: InstanceDataType, *others: InstanceDataType) InstanceDataType

Concatenate the results of the InstanceData objects.

Alias for first.concatenate(*others).

Module contents