lir.data_strategies package

class lir.data_strategies.AutoCrossValidation(folds: int, shuffle: bool | None = None, random_state: int | None = None)[source]

Bases: DataStrategy

K-fold cross-validation iterator over successive train/test splits.

This splitter attempts to find a suitable splitting strategy for the input data. Candidate strategies are:

The data strategy will be decided in apply(). Subsequent calls to apply() are not guaranteed to use the same strategy, although in realistic use cases this is most likely the case. If no suitable strategy is found, a ValueError is raised.

This strategy may be referenced in a YAML setup as follows:

splits:
  strategy: auto_cross_validation
  folds: 5  # the number k in k-fold cross-validation
  random_state: 42  # optional
Parameters:
  • folds (int) – Number of cross-validation folds to generate.

  • shuffle (bool | None) – Whether to shuffle the groups before splitting into batches. If None, the data will be shuffled if random_state is not None.

  • random_state (int | None) – Random seed controlling stochastic behavior for reproducible results.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Return an iterator over k train/test splits.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

Returns:

An iterator over pairs of a training set and a test set.

Return type:

Iterator[tuple[DataType, DataType]]

class lir.data_strategies.AutoTrainTestSplit(test_size: float | int = 0.5, random_state: int | None = None)[source]

Bases: DataStrategy

Split the data into a training set and a test set.

This splitter attempts to find a suitable splitting strategy for the input data. Candidate strategies are, in order of priority:

The data strategy will be decided in apply(). Subsequent calls to apply() are not guaranteed to use the same strategy, although in realistic use cases this is most likely the case. If no suitable strategy is found, a ValueError is raised.

In an experiment setup file, the split strategy can be referenced as:

splits:
  strategy: auto_train_test
  test_size: 0.2  # the (hold-out) test set  is 20% of the data
  seed: 42  # optional
Parameters:
  • test_size (float | int) – Size of the test set. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. The default value is 0.5.

  • random_state (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:

instances (DataType) – Instances to split.

Returns:

An iterator over pairs of a training set and a test set, which is a tuple of the training set and the test set.

Return type:

Iterator[tuple[DataType, DataType]]

class lir.data_strategies.CrossValidation(folds: int, shuffle: bool | None = None, seed: int | None = None)[source]

Bases: DataStrategy

K-fold cross-validation iterator over successive train/test splits.

The input data must contain hypothesis labels. If the data has source_ids but not hypothesis labels, use SourcesCrossValidation instead.

Each fold is constructed so that instances from both hypotheses are present in every split.

This strategy may be registered in a YAML registry as follows:

splits:
  strategy: cross_validation
  folds: 5  # the number k in k-fold cross-validation
  seed: 42  # optional
Parameters:
  • folds (int) – Number of cross-validation folds to generate.

  • shuffle (bool | None) – Whether to shuffle the data splitting. If None, the data will be shuffled if random_state is not None.

  • seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Return an iterator over k train/test splits.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

class lir.data_strategies.LeaveOneSourceOut[source]

Bases: DataStrategy

Leave-one-out by source id.

This data strategy uses the source_ids attribute and assigns a single source at a time to the test set, and the others to the training set. There will be as many splits as there are sources, so that each source will be in the test set once.

In an experiment setup file, the data strategy can be referenced as:

splits:
  strategy: leave_one_source_out
apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Perform leave-one-source-out.

Parameters:

instances (InstanceDataType) – Input instances with a source_ids attribute.

Returns:

An iterator over train/test splits.

Return type:

Iterator[tuple[DataType, DataType]]

class lir.data_strategies.PairsTrainTestSplit(test_size: float | int, seed: int | None = None)[source]

Bases: DataStrategy

A train/test split policy for paired instances.

The input data should have source_ids with two columns. This split assigns all sources to either the training set or the test set. The pairs are assigned to training or testing if both of their sources have that role. Pairs with mixed roles are omitted.

Parameters:
  • test_size (float | int) – Fraction or absolute number of items assigned to the test split.

  • seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

Yields:

tuple[DataType, DataType] – An iterator over a single item, which is a tuple of the training set and the test set.

class lir.data_strategies.PredefinedCrossValidation[source]

Bases: DataStrategy

Split data into cross validation folds based on predefined assignments.

This strategy expects a fold_assignments field in the data. For example, the parse_features_from_csv_file with the fold_assignment_column specifeid will create this field.

Each instance should be labelled according in which test set (fold) the instance should be. This means that care should be taken to use the correct number of folds (= number of unique labels) and wether the folds are based on sources or on instances.

In the experiment setup file, this split strategy can be referenced as follows:

cross_validation_splits:
    strategy: predefined_cross_validation
apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Perform cross-validation based on predefined fold assignments.

This strategy expects a fold_assignments field in the data, where each instance is labelled with a fold identifier. The strategy will return one train/test split for each unique fold identifier, using the instances with that identifier as the test set and the others as the training set.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

class lir.data_strategies.PredefinedTrainTestSplit[source]

Bases: DataStrategy

Split data into a training set and a test set based on predefined assignments.

This strategy expects a role_assignments field in the data, where each instance is labelled either "train" (included in the training set) or "test" (included in the test set).

In the experiment setup file, this split strategy can be referenced as follows:

train_test_splits:
    strategy: predefined_train_test
apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

Yields:

tuple[DataType, DataType] – An iterator over a single item, which is a tuple of the training set and the test set.

class lir.data_strategies.RoleAssignment(*values)[source]

Bases: Enum

Indicate whether the data is part of the train or the test split.

TEST = 'test'
TRAIN = 'train'
class lir.data_strategies.SourcesCrossValidation(folds: int, shuffle: bool | None = None, random_state: int | None = None)[source]

Bases: DataStrategy

K-fold cross-validation by source id.

This data strategy uses the source_ids attribute and distributes the sources over k different subsets. If the data have hypothesis labels, use CrossValidation instead.

Each of the subsets will be offered once as the test set, using the others as the training set. Each source is assigned to exactly one of the subsets, and no sources will have instances that appear in more than one.

In an experiment setup file, the data strategy can be referenced as:

splits:
  strategy: cross_validation_sources
  folds: 5
  random_state: 0

This class internally uses GroupKFold.

Parameters:
  • folds (int) – Number of cross-validation folds to generate.

  • shuffle (bool | None) – Whether to shuffle the groups before splitting into batches. Note that the samples within each split will not be shuffled. If None, the data will be shuffled if random_state is not None.

  • random_state (int | None) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Perform k-fold cross-validation.

Return an iterator over k train/test splits.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

class lir.data_strategies.SourcesTrainTestSplit(test_size: float | int, seed: int | None = None)[source]

Bases: DataStrategy

Split the data into a training set and a test set by their source ids.

This splitter uses the source_ids attribute and distributes the sources over the training and test set. Each source is assigned to either the training set or the test set, but no sources will have instances that appear in both.

This splitter is suitable for most common-source setups. Alternatively, use the SourcesCrossValidation strategy for cross-validation.

In an experiment setup file, the split strategy can be referenced as:

splits:
  strategy: train_test_sources
  test_size: 0.5  # the proportion of sources in the test set
  seed: 42        # optional

This class internally uses sklearn.model_selection.GroupShuffleSplit.

Parameters:
  • test_size (float | int) – Fraction or absolute number of items assigned to the test split. If float, should be between 0.0 and 1.0 and represent the proportion of sources to include inthe test split (rounded up). If int, represents the absolute number of test sources.

  • seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:

instances (InstanceDataType) – Input instances to be processed by this method.

Yields:

tuple[DataType, DataType] – An iterator over a single item, which is a tuple of the training set and the test set.

class lir.data_strategies.TrainTestSplit(test_size: float | int, seed: int | None = None)[source]

Bases: DataStrategy

Split the data into a training set and a test set.

This splitter distributes the instances randomly over a training set and test set. Each instance is assigned to either the training set or the test set, but no sources will have instances that appear in both. The hypothesis labels are used to distribute the instances of each hypothesis proportionally to both sets.

This splitter is suitable for most specific-source setups. If you have a common-source setup, take a look at SourcesTrainTestSplit. Alternatively, use the CrossValidation strategy for cross-validation.

The input data should have hypothesis labels. This split assigns instances of both classes to the training set and the test set.

In an experiment setup file, the split strategy can be referenced as:

splits:
  strategy: train_test
  test_size: 0.2  # the (hold-out) test set  is 20% of the data
  seed: 42  # optional
Parameters:
  • test_size (float | int) – Size of the test set. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

  • seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:

instances (DataType) – Instances to split.

Yields:

tuple[DataType, DataType] – An iterator over a single item, which is a tuple of the training set and the test set.

Submodules