lir.data_strategies.sources module
- class lir.data_strategies.sources.LeaveOneSourceOut[source]
Bases:
DataStrategyLeave-one-out by source id.
This data strategy uses the
source_idsattribute and assigns a single source at a time to the test set, and the others to the training set. There will be as many splits as there are sources, so that each source will be in the test set once.In an experiment setup file, the data strategy can be referenced as:
splits: strategy: leave_one_source_out
- class lir.data_strategies.sources.SourcesCrossValidation(folds: int, shuffle: bool | None = None, random_state: int | None = None)[source]
Bases:
DataStrategyK-fold cross-validation by source id.
This data strategy uses the
source_idsattribute and distributes the sources over k different subsets. If the data have hypothesis labels, useCrossValidationinstead.Each of the subsets will be offered once as the test set, using the others as the training set. Each source is assigned to exactly one of the subsets, and no sources will have instances that appear in more than one.
In an experiment setup file, the data strategy can be referenced as:
splits: strategy: cross_validation_sources folds: 5 random_state: 0
This class internally uses
GroupKFold.- Parameters:
folds (int) – Number of cross-validation folds to generate.
shuffle (bool | None) – Whether to shuffle the groups before splitting into batches. Note that the samples within each split will not be shuffled. If None, the data will be shuffled if random_state is not None.
random_state (int | None) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls.
- class lir.data_strategies.sources.SourcesTrainTestSplit(test_size: float | int, seed: int | None = None)[source]
Bases:
DataStrategySplit the data into a training set and a test set by their source ids.
This splitter uses the
source_idsattribute and distributes the sources over the training and test set. Each source is assigned to either the training set or the test set, but no sources will have instances that appear in both.This splitter is suitable for most common-source setups. Alternatively, use the
SourcesCrossValidationstrategy for cross-validation.In an experiment setup file, the split strategy can be referenced as:
splits: strategy: train_test_sources test_size: 0.5 # the proportion of sources in the test set seed: 42 # optional
This class internally uses
sklearn.model_selection.GroupShuffleSplit.- Parameters:
test_size (float | int) – Fraction or absolute number of items assigned to the test split. If float, should be between 0.0 and 1.0 and represent the proportion of sources to include inthe test split (rounded up). If int, represents the absolute number of test sources.
seed (int | None) – Random seed controlling stochastic behaviour for reproducible results.
- apply(instances: DataType) Iterator[tuple[DataType, DataType]][source]
Split the data into a training set and a test set.
- Parameters:
instances (InstanceDataType) – Input instances to be processed by this method.
- Yields:
tuple[DataType, DataType] – An iterator over a single item, which is a tuple of the training set and the test set.
- lir.data_strategies.sources.is_valid_input(instances: InstanceData) bool[source]
Return True iff source id based strategies can be applied.