lir.data.models module

class lir.data.models.DataProvider[source]

Bases: ABC

Base class for data providers.

Each data provider should provide access to instance data by implementing the get_instances() method.

abstractmethod get_instances() → InstanceData[source]

Return an InstanceData object, containing data for a set of instances.

Returns:: Instance data object produced by this operation.
Return type:: InstanceData

class lir.data.models.DataStrategy[source]

Bases: ABC

Base class for data (splitting) strategies.

abstractmethod apply(instances: DataType) → Iterable[tuple[DataType, DataType]][source]

Provide iterator to access training and test set.

Returns an iterator over tuples of a training set and a test set. Both the training set and the test is represented by an InstanceData object.

Parameters:: instances (DataType) – Input instances to be processed by this method.
Returns:: Iterable of (train_set, test_set) splits for the provided data.
Return type:: Iterable[tuple[DataType, DataType]]

class lir.data.models.FeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: Annotated[ndarray | None, AfterValidator(func=_validate_source_ids)] = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], **extra_data: Any)[source]

Bases: InstanceData

Data class for feature data.

Feature data can be any type of numeric data that is associated with the instances, such as measurements on a single instance or similarity scores between a pair of instances.

If the object describes single instance data, the features attribute is generally 2-dimensional, with one row per instance and one or more feature columns.

More than 2 dimensions may be used for paired data, see PairedFeatureData.

- features

Type:: an array of instance features, with one row per instance

check_features() → Self[source]

Validate the features.

Returns:: This feature-data object after numeric type validation.
Return type:: Self

check_matching_shapes() → Self[source]

Validate the shape of the features and the labels are matching.

Returns:: This feature-data object after shape consistency checks.
Return type:: Self

features: Annotated[ndarray, AfterValidator(func=_validate_features)]

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class lir.data.models.InstanceData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: Annotated[ndarray | None, AfterValidator(func=_validate_source_ids)] = None, **extra_data: Any)[source]

Bases: BaseModel, ABC

Base class for data on instances.

An InstanceData object may be labeled or unlabeled with ground-truth data. If it is labeled, the label values correspond to the hypotheses and have values 0 or 1. In literature, the labels may have different names for values 1 and 0 respectively, such as:

hypothesis 1 and hypothesis 2 (or H1 and H2)
prosecutor’s hypothesis and defense hypothesis (or Hp and Hd)
same-source and different-source (or Hss and Hds)

The instances may optionally be associated with sources by means of the source_ids attribute. If available, each instance will generally have one source id if the object holds single instances, or two source ids if the object holds pairs of instances.

This class imposes no restrictions on the actual instance data. Sub class implementations will specialize in particular data types.

- `labels`

either 0 or 1.

Type:: The hypothesis labels of the instances, as a 1-dimensional array with one value per instance, can be

- `source_ids`

except if it is a pair, in which case it has two sources. The source ids is either a 1-dimensional array or a 2-dimensional array with two columns.

Type:: The ids of all sources that contributed to the instances. Each instance is from a single source,

property all_fields: list[str]

Return all available field names for this data object.

Returns:: Names of all standard and extra fields available on the instance.
Return type:: list[str]

apply(fn: Callable, *args: Any, **kwargs: Any) → Self[source]

Apply a custom function to this InstanceData object.

The function fn is applied to all Numpy fields. Other fields are copied as-is.

Parameters:

fn (Callable) – Value passed via fn.
*args (Any) – Additional positional arguments forwarded to the underlying call.
**kwargs (Any) – Additional keyword arguments forwarded to the underlying call.

Returns:

New instance data object after applying the function to numpy fields.

Return type:

Self

check_both_labels() → ndarray[source]

Return labels or raise an error if they are missing or if they do not represent both hypotheses.

Raise:: ValueError if hypothesis labels are missing or either label is not represented.
Returns:: Label array containing both classes 0 and 1.
Return type:: np.ndarray

check_sourceids_labels_match() → Self[source]

Validate the source_ids and labels have matching shapes.

Returns:: This instance data object after post-init validation.
Return type:: Self

combine(others: list[InstanceData] | InstanceData, fn: Callable, *args: Any, **kwargs: Any) → Self[source]

Apply a custom combination function to InstanceData objects.

All objects must have the same types and fields, and the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using fn. Other fields are copied as-is.

Parameters:

others ('list[InstanceData] | InstanceData') – Value passed via others.
fn (Callable) – Value passed via fn.
*args (Any) – Additional positional arguments forwarded to the underlying call.
**kwargs (Any) – Additional keyword arguments forwarded to the underlying call.

Returns:

New instance data object after applying the combination function.

Return type:

Self

concatenate(*others: InstanceData) → Self[source]

Concatenate instances from InstanceData objects.

All concatenated objects must have the same types and fields. How fields are concatenated may depend on the subclass. By default, they must have the same values for all non-numpy array fields, or an error is raised. Numpy fields are concatenated using np.concatenate. Other fields are copied as-is.

Returns a new object with the concatenated instances.

Parameters:: *others ('InstanceData') – Value passed via others.
Returns:: New instance data object with concatenated rows.
Return type:: Self

property has_labels: bool

Indicate whether label values are available.

Returns:: True when label information is present.
Return type:: bool

has_same_type(other: Any) → bool[source]

Compare these instance data to another class.

Returns True iff: - other has the same class - other has the same fields - all fields have the same type

Parameters:: other (Any) – Value passed via other.
Returns:: True when type, fields, and field value types all match.
Return type:: bool

labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)]

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

replace(**kwargs: Any) → Self[source]

Return a modified copy with updated values.

Parameters:: **kwargs (Any) – Additional keyword arguments forwarded to the underlying call.
Returns:: Copy of this object with the provided fields replaced.
Return type:: Self

replace_as(datatype: type[InstanceDataType], **kwargs: Any) → InstanceDataType[source]

Return a modified copy with updated data type and values.

Parameters:

datatype (type['InstanceDataType']) – Value passed via datatype.
**kwargs (Any) – Additional keyword arguments forwarded to the underlying call.

Returns:

Instance data object produced by this operation.

Return type:

‘InstanceDataType’

property require_labels: ndarray

Return labels and guarantee that it is not None (or raise an error).

Returns:: Label array guaranteed to contain values for both hypotheses.
Return type:: np.ndarray

source_ids: Annotated[ndarray | None, AfterValidator(func=_validate_source_ids)]

property source_ids_1d: ndarray

Return source identifiers as a one-dimensional array.

Returns:: One-dimensional source-id array with one source per instance.
Return type:: np.ndarray

class lir.data.models.LLRData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: Annotated[ndarray | None, AfterValidator(func=_validate_source_ids)] = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], llr_upper_bound: float | None = None, llr_lower_bound: float | None = None, **extra_data: Any)[source]

Bases: FeatureData

Representation of calculated LLR values.

An object of LLRData adds a specific interpretation to the features attribute.

If the features attribute has a single column (i.e. dimensions (n, 1)), the values are LLRs.
If the features attribute has three columns (i.e. dimensions (n, 3)), the values are LLRs and their confidence intervals.

The values are also accessible by the attributes llrs and llr_intervals.

- llrs

Type:: 1-dimensional numpy array of LLR values

- has_intervals

Type:: indicate whether the LLR’s have intervals

- llr_intervals

Type:: numpy array of LLR values of dimensions (n, 2), or None if the LLR’s have no intervals

- llr_upper_bound

Type:: upper bound applied to the LLRs, or None if no upper bound was applied

- llr_lower_bound

Type:: lower bound applied to the LLRs, or None if no lower bound was applied

check_features_are_llrs() → Self[source]

Validate the feature data.

Returns:: This LLR object after validating LLR-specific feature constraints.
Return type:: Self

check_misleading_finite() → None[source]: Check whether all values are either finite or not misleading.

feature_for_plot(source_key: str) → ndarray | None[source]

Return the feature values for a given source key, or None if not available.

The return value has to be saved during the LR system execution by using the save_features_after_step configuration option. If the feature values for the given source key are not available, this method returns None. Use the require_feature_for_plots if you want to raise an error instead of returning None when the feature values are not available.

Parameters:: source_key (str) – Key identifying the source of the feature values to be returned.
Returns:: Feature values for the specified source key, or None if not available.
Return type:: np.ndarray | None

property has_intervals: bool

Indicate whether interval bounds are present for each LLR.

Returns:: True when lower and upper interval bounds are included.
Return type:: bool

property llr_bounds: tuple[float | None, float | None]

Return global lower and upper bounds applied to LLR values.

Returns:: Tuple containing global lower and upper LLR clipping bounds.
Return type:: tuple[float | None, float | None]

property llr_intervals: ndarray | None

Return interval bounds for each LLR when available.

Returns:: Two-column array with lower and upper LLR bounds, if available.
Return type:: np.ndarray | None

llr_lower_bound: float | None

llr_upper_bound: float | None

property llrs: ndarray

Return the core LLR values.

Returns:: One-dimensional array containing the central LLR values.
Return type:: np.ndarray

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

require_feature_for_plots(source_key: str) → ndarray[source]

Return the feature values for a given source key, raising an error if not available.

If the feature values for the given source key are not available, this method raises a ValueError with an informative error message. Use the feature_for_plot method if you want to return None instead of raising an error when the feature values for the given source key are not available.

Parameters:: source_key (str) – Key identifying the source of the feature values to be returned.
Returns:: Feature values for the specified source key.
Return type:: np.ndarray
Raises:: ValueError – If the feature values for the given source key are not available.

class lir.data.models.PairedFeatureData(*, labels: Annotated[ndarray | None, AfterValidator(func=_validate_labels)] = None, source_ids: Annotated[ndarray | None, AfterValidator(func=_validate_source_ids)] = None, features: Annotated[ndarray, AfterValidator(func=_validate_features)], n_trace_instances: int, n_ref_instances: int, **extra_data: Any)[source]

Bases: FeatureData

Data class for instance pair data.

Each item in this data set represents instances from the “trace” source and from the “reference” source. The number of instances from either source must be at least one.

The features attribute has at least 3 dimensions:

the pairs are along the first dimension;
the instances are along the second dimension (e.g. in a comparison of 1 trace instance and 1 reference instance, the length of this dimension is 2);
the features are along the third dimension onward.

The source_ids, if available, must have two values for each item, i.e. 2 columns.

- n_trace_instances

Type:: the number of trace instances in each pair

- n_ref_instances

Type:: the number of reference instances in each pair

- features

second

Type:: the features of all instances in the pair, with pairs along the first dimension, and instances along the

- source_ids

columns

Type:: the source ids of the trace and reference instances of each pair, a 2-dimensional array with two

- features_trace

Type:: the features of the trace instances

- features_ref

Type:: the features of the reference instances

- source_ids_trace

Type:: the source ids of the trace instances

- source_ids_ref

Type:: the source ids of the reference instances

check_features_dimensions() → Self[source]

Validate feature dimensions.

Returns:: This paired-feature object after feature-dimension validation.
Return type:: Self

check_sourceid_shape() → Self[source]

Override the InstanceData implementation.

Returns:: This paired-feature object after source-id shape validation.
Return type:: Self

property features_ref: ndarray

Get the features of the reference instances.

Returns:: Feature tensor slice containing reference-instance features.
Return type:: np.ndarray

property features_trace: ndarray

Get the features of the trace instances.

Returns:: Feature tensor slice containing trace-instance features.
Return type:: np.ndarray

model_config = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_ref_instances: int

n_trace_instances: int

property source_ids_ref: ndarray | None

Get the source ids of the reference instances.

Returns:: Reference source IDs when available, otherwise None.
Return type:: np.ndarray | None

property source_ids_trace: ndarray | None

Get the source ids of the trace instances.

Returns:: Trace source IDs when available, otherwise None.
Return type:: np.ndarray | None

lir.data.models.concatenate_instances(first: InstanceDataType, *others: InstanceDataType) → InstanceDataType[source]

Concatenate the results of the InstanceData objects.

Alias for first.concatenate(*others).

Parameters:

first (InstanceDataType) – Value passed via first.
*others (InstanceDataType) – Value passed via others.

Returns:

Instance data object produced by this operation.

Return type:

InstanceDataType

lir.data.models.get_instances_by_category(instances: InstanceDataType, category_field: str, category_shape: tuple[int] | None = None) → Iterator[tuple[ndarray, InstanceDataType]][source]

Return subsets of a set of instances by category.

The instances object must have a field by the name of category_field. That field is a numpy array with one row per instance. Its values are the categories of each instance. The field may have any shape, as long as the number of rows matches the number of instances.

If category_shape is provided, the shape of the category field is checked against this value.

The returned value is an iterator with each item being a tuple of the category and the subset of instances of that category.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.
category_field (str) – Value passed via category_field.
category_shape (tuple[int] | None) – Value passed via category_shape.