lir.data_strategies.labels module

class lir.data_strategies.labels.CrossValidation(folds: int, shuffle: bool | None = None, seed: int | None = None)[source]

Bases: DataStrategy

K-fold cross-validation iterator over successive train/test splits.

The input data must contain hypothesis labels. If the data has source_ids but not hypothesis labels, use SourcesCrossValidation instead.

Each fold is constructed so that instances from both hypotheses are present in every split.

This strategy may be registered in a YAML registry as follows:

splits:
  strategy: cross_validation
  folds: 5  # the number k in k-fold cross-validation
  seed: 42  # optional

Parameters:

folds (int) – Number of cross-validation folds to generate.
shuffle (bool | None) – Whether to shuffle the data splitting. If None, the data will be shuffled if random_state is not None.
seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) → Iterator[tuple[DataType, DataType]][source]

Return an iterator over k train/test splits.

Parameters:: instances (InstanceDataType) – Input instances to be processed by this method.

class lir.data_strategies.labels.TrainTestSplit(test_size: float | int, seed: int | None = None)[source]

Bases: DataStrategy

Split the data into a training set and a test set.

This splitter distributes the instances randomly over a training set and test set. Each instance is assigned to either the training set or the test set, but no sources will have instances that appear in both. The hypothesis labels are used to distribute the instances of each hypothesis proportionally to both sets.

This splitter is suitable for most specific-source setups. If you have a common-source setup, take a look at SourcesTrainTestSplit. Alternatively, use the CrossValidation strategy for cross-validation.

The input data should have hypothesis labels. This split assigns instances of both classes to the training set and the test set.

In an experiment setup file, the split strategy can be referenced as:

splits:
  strategy: train_test
  test_size: 0.2  # the (hold-out) test set  is 20% of the data
  seed: 42  # optional

Parameters:

test_size (float | int) – Size of the test set. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) → Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:: instances (DataType) – Instances to split.
Yields:: tuple[DataType, DataType] – An iterator over a single item, which is a tuple of the training set and the test set.

lir.data_strategies.labels.is_valid_input(instances: InstanceData) → bool[source]: Return True iff label-based strategies can be applied.