lir.data_strategies.auto module

class lir.data_strategies.auto.AutoCrossValidation(folds: int, shuffle: bool | None = None, random_state: int | None = None)[source]

Bases: DataStrategy

K-fold cross-validation iterator over successive train/test splits.

This splitter attempts to find a suitable splitting strategy for the input data. Candidate strategies are:

PredefinedCrossValidation, if the dataset has role assignments;
SourcesCrossValidation, if the dataset has pairs with source ids (i.e., two source ids per pair);
CrossValidation, if the instances in the dataset have hypothesis labels.

The data strategy will be decided in apply(). Subsequent calls to apply() are not guaranteed to use the same strategy, although in realistic use cases this is most likely the case. If no suitable strategy is found, a ValueError is raised.

This strategy may be referenced in a YAML setup as follows:

splits:
  strategy: auto_cross_validation
  folds: 5  # the number k in k-fold cross-validation
  random_state: 42  # optional

Parameters:

folds (int) – Number of cross-validation folds to generate.
shuffle (bool | None) – Whether to shuffle the groups before splitting into batches. If None, the data will be shuffled if random_state is not None.
random_state (int | None) – Random seed controlling stochastic behavior for reproducible results.

apply(instances: DataType) → Iterator[tuple[DataType, DataType]][source]

Return an iterator over k train/test splits.

Parameters:: instances (InstanceDataType) – Input instances to be processed by this method.
Returns:: An iterator over pairs of a training set and a test set.
Return type:: Iterator[tuple[DataType, DataType]]

class lir.data_strategies.auto.AutoTrainTestSplit(test_size: float | int = 0.5, random_state: int | None = None)[source]

Bases: DataStrategy

Split the data into a training set and a test set.

This splitter attempts to find a suitable splitting strategy for the input data. Candidate strategies are, in order of priority:

PredefinedTrainTestSplit, if the dataset has role assignments;
PairsTrainTestSplit, if the dataset has pairs with source ids (i.e., two source ids per pair);
SourcesTrainTestSplit, if the instances in the dataset have source ids;
TrainTestSplit, if the instances in the dataset have hypothesis labels.

The data strategy will be decided in apply(). Subsequent calls to apply() are not guaranteed to use the same strategy, although in realistic use cases this is most likely the case. If no suitable strategy is found, a ValueError is raised.

In an experiment setup file, the split strategy can be referenced as:

splits:
  strategy: auto_train_test
  test_size: 0.2  # the (hold-out) test set  is 20% of the data
  seed: 42  # optional

Parameters:

test_size (float | int) – Size of the test set. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. The default value is 0.5.
random_state (int | None) – Random seed controlling stochastic behaviour for reproducible results.

apply(instances: DataType) → Iterator[tuple[DataType, DataType]][source]

Split the data into a training set and a test set.

Parameters:: instances (DataType) – Instances to split.
Returns:: An iterator over pairs of a training set and a test set, which is a tuple of the training set and the test set.
Return type:: Iterator[tuple[DataType, DataType]]