lir.datasets.feature_data_csv module

class lir.datasets.feature_data_csv.ExtraField(name: str, column_names: list[str], validate_cell: Callable[[str], Any])[source]

Bases: NamedTuple

Extra field for CSV parsing.

column_names: list[str]

Alias for field number 1

name: str

Alias for field number 0

parse_row(row: dict[str, str]) list[Any][source]

Take the appropriate values from a dictionary and return them as a list.

Parameters:

row (dict[str, str]) – CSV row dictionary to parse.

Returns:

Parsed values extracted from the input row.

Return type:

list[Any]

validate_cell: Callable[[str], Any]

Alias for field number 2

class lir.datasets.feature_data_csv.FeatureDataCsvFileParser(file: PathLike, **kwargs: Any)[source]

Bases: FeatureDataCsvParser

Read CSV data from file.

Parameters:
  • file (PathLike) – Path to the input file.

  • **kwargs (Any) – Additional keyword arguments forwarded to the underlying FeatureDataCsvParser call.

get_instances() FeatureData[source]

Retrieve FeatureData instances.

Returns:

FeatureData object parsed from the source.

Return type:

FeatureData

class lir.datasets.feature_data_csv.FeatureDataCsvHttpParser(url: str, session: Session, **kwargs: Any)[source]

Bases: FeatureDataCsvParser

Read CSV data from a URL.

By default, this class uses requests-cache to cache retrieved data. The cache is persistent and located in the user cache folder, which is written to the log file.

Parameters:
  • url (str) – URL of the remote resource to read.

  • session (requests.Session) – Value passed via session.

  • **kwargs (Any) – Additional keyword arguments forwarded to the underlying call.

get_instances() FeatureData[source]

Retrieve FeatureData from the remote resource.

Returns:

FeatureData object parsed from the source.

Return type:

FeatureData

class lir.datasets.feature_data_csv.FeatureDataCsvParser(source_id_column: str | list[str] | None = None, label_column: str | None = None, instance_id_column: str | None = None, role_assignment_column: str | None = None, fold_assignment_column: str | None = None, extra_fields: list[ExtraField] | None = None, ignore_columns: list[str] | None = None, head: int | None = None, message_prefix: str = '')[source]

Bases: DataProvider, ABC

Parse a CSV file into a FeatureData object.

This is an abstract base class with concrete implementations for different data sources:

Parameters:
  • source_id_column (str | list[str] | None) – Column name(s) containing source identifiers.

  • label_column (str | None) – Column name containing class labels.

  • instance_id_column (str | None) – Column name containing instance identifiers.

  • role_assignment_column (str | None) – Column name containing predefined train/test roles.

  • fold_assignment_column (str | None) – Column name containing predefined fold assignments.

  • extra_fields (list[ExtraField] | None) – Optional extra fields to parse from each row.

  • ignore_columns (list[str] | None) – Column names ignored when extracting features.

  • head (int | None) – Maximum number of rows to read from the source.

  • message_prefix (str) – Prefix added to parser log and error messages.

Examples

Assume a CSV file containing two features and source identifiers:

source_id,feature1,feature2,feature3,name_of_an_irrelevant_column
0,1,10,1,sherlock
0,1,11,1,holmes
1,20,30,1,irene
1,18,32,3,adler
2,5,10,8,professor
2,1,11,8,moriarty

This file can be parsed using the following YAML configuration:

data:
  provider: parse_features_from_csv_file
    path: path/to/file.csv
    source_id_column: source_id
    ignore_columns:
      - name_of_an_irrelevant_column
data:
  provider: parse_features_from_csv_url
    url: https://raw.githubusercontent.com/NetherlandsForensicInstitute/elemental_composition_glass/refs/heads/main/training.csv
    source_id_column: Item
    ignore_columns:
      - id
      - Piece
source_id_columns: list[str]
class lir.datasets.feature_data_csv.FeatureDataCsvStreamParser(fp: IO, **kwargs: Any)[source]

Bases: FeatureDataCsvParser

Read data from a streamed CSV.

Parameters:
  • fp (IO) – Open file-like object to read from.

  • **kwargs (Any) – Additional keyword arguments forwarded to the underlying call.

get_instances() FeatureData[source]

Retrieve FeatureData instances from CSV stream.

Returns:

FeatureData object parsed from the source.

Return type:

FeatureData