lir.data_strategies.auto module
- class lir.data_strategies.auto.AutoCrossValidation(folds: int, shuffle: bool | None = None, random_state: int | None = None)[source]
Bases:
DataStrategyK-fold cross-validation iterator over successive train/test splits.
This splitter attempts to find a suitable splitting strategy for the input data. Candidate strategies are:
PredefinedCrossValidation, if the dataset has role assignments;SourcesCrossValidation, if the dataset has pairs with source ids (i.e., two source ids per pair);CrossValidation, if the instances in the dataset have hypothesis labels.
The data strategy will be decided in
apply(). Subsequent calls toapply()are not guaranteed to use the same strategy, although in realistic use cases this is most likely the case. If no suitable strategy is found, aValueErroris raised.This strategy may be referenced in a YAML setup as follows:
splits: strategy: auto_cross_validation folds: 5 # the number k in k-fold cross-validation random_state: 42 # optional
- Parameters:
folds (int) – Number of cross-validation folds to generate.
shuffle (bool | None) – Whether to shuffle the groups before splitting into batches. If None, the data will be shuffled if random_state is not None.
random_state (int | None) – Random seed controlling stochastic behavior for reproducible results.
- apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]
Return an iterator over k train/test splits.
- Parameters:
instances (InstanceDataType) – Input instances to be processed by this method.
- Returns:
An iterator over pairs of a training set and a test set.
- Return type:
Iterator[tuple[DataType, DataType]]
- class lir.data_strategies.auto.AutoTrainTestSplit(test_size: float | int = 0.5, random_state: int | None = None)[source]
Bases:
DataStrategySplit the data into a training set and a test set.
This splitter attempts to find a suitable splitting strategy for the input data. Candidate strategies are, in order of priority:
PredefinedTrainTestSplit, if the dataset has role assignments;PairsTrainTestSplit, if the dataset has pairs with source ids (i.e., two source ids per pair);SourcesTrainTestSplit, if the instances in the dataset have source ids;TrainTestSplit, if the instances in the dataset have hypothesis labels.
The data strategy will be decided in
apply(). Subsequent calls toapply()are not guaranteed to use the same strategy, although in realistic use cases this is most likely the case. If no suitable strategy is found, aValueErroris raised.In an experiment setup file, the split strategy can be referenced as:
splits: strategy: auto_train_test test_size: 0.2 # the (hold-out) test set is 20% of the data seed: 42 # optional
- Parameters:
test_size (float | int) – Size of the test set. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. The default value is 0.5.
random_state (int | None) – Random seed controlling stochastic behaviour for reproducible results.
- apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]
Split the data into a training set and a test set.
- Parameters:
instances (DataType) – Instances to split.
- Returns:
An iterator over pairs of a training set and a test set, which is a tuple of the training set and the test set.
- Return type:
Iterator[tuple[DataType, DataType]]