deeprvat.deeprvat.train

Module Contents

Classes

MultiphenoDataset

class used to structure the data and present a __getitem__ function to the dataloader, that will be used to load batches into the model

MultiphenoBaggingData

Preprocess the underlying dataframe, to then load it into a dataset object

Functions

cli

subset_samples

make_dataset_

Subfunction of make_dataset() Convert a dataset file to the sparse format used for training and testing associations

make_dataset

Uses function make_dataset_() to convert dataset to sparse format and stores the respective data

run_bagging

Main function called during training. Also used for trial pruning and sampling new parameters in optuna.

train

Main function called during training. Also used for trial pruning and sampling new parameters in Optuna.

best_training_run

Function to extract the best trial from an Optuna study and handle associated model checkpoints and configurations.

Data

logger

METRICS

OPTIMIZERS

ACTIVATIONS

DEFAULT_OPTIMIZER

API

deeprvat.deeprvat.train.logger = 'getLogger(...)'
deeprvat.deeprvat.train.METRICS = None
deeprvat.deeprvat.train.OPTIMIZERS = None
deeprvat.deeprvat.train.ACTIVATIONS = None
deeprvat.deeprvat.train.DEFAULT_OPTIMIZER = None
deeprvat.deeprvat.train.cli()
deeprvat.deeprvat.train.subset_samples(input_tensor: torch.Tensor, covariates: torch.Tensor, y: torch.Tensor, min_variant_count: int) Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
deeprvat.deeprvat.train.make_dataset_(debug: bool, pickle_only: bool, compression_level: int, training_dataset_file: Optional[str], config_file: Union[str, pathlib.Path], input_tensor_out_file: str, covariates_out_file: str, y_out_file: str)

Subfunction of make_dataset() Convert a dataset file to the sparse format used for training and testing associations

Parameters:
  • config (Dict) – Dictionary containing configuration parameters, build from YAML file

  • debug (bool) – Use a strongly reduced dataframe (optional)

  • training_dataset_file (str) – Path to the file in which training data is stored. (optional)

  • pickle_only (bool) – If True, only store dataset as pickle file and return None. (optional)

Returns:

Tuple containing input_tensor, covariates, and target values.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

deeprvat.deeprvat.train.make_dataset(debug: bool, pickle_only: bool, compression_level: int, training_dataset_file: Optional[str], config_file: str, input_tensor_out_file: str, covariates_out_file: str, y_out_file: str)

Uses function make_dataset_() to convert dataset to sparse format and stores the respective data

Parameters:
  • debug (bool) – Use a strongly reduced dataframe

  • pickle_only (bool) – Flag to indicate whether only to save data using pickle

  • compression_level (int) – Level of compression in ZARR to be applied to training data.

  • training_dataset_file (Optional[str]) – Path to the file in which training data is stored. (optional)

  • config_file (str) – Path to a YAML file, which serves for configuration.

  • input_tensor_out_file (str) – Path to save the training data to.

  • covariates_out_file (str) – Path to save the covariates to.

  • y_out_file (str) – Path to save the ground truth data to.

Returns:

None

class deeprvat.deeprvat.train.MultiphenoDataset(data: Dict[str, Dict], batch_size: int, split: str = 'train', cache_tensors: bool = False, temp_dir: Optional[str] = None, chunksize: int = 1000)

Bases: torch.utils.data.Dataset

class used to structure the data and present a __getitem__ function to the dataloader, that will be used to load batches into the model

Initialization

Initialize the MultiphenoDataset.

Parameters:
  • data (Dict[str, Dict]) – Underlying dataframe from which data is structured into batches.

  • min_variant_count (int) – Minimum number of variants available for each gene.

  • batch_size (int) – Number of samples/individuals available in one batch.

  • split (str) – Contains a prefix indicating the dataset the model operates on. Defaults to “train”. (optional)

  • cache_tensors (bool) – Indicates if samples have been pre-loaded or need to be extracted from zarr. (optional)

__len__()

Denotes the total number of batches

__getitem__(index)

Generates one batch of data

index_input_tensor_zarr(pheno: str, indices: numpy.ndarray)
class deeprvat.deeprvat.train.MultiphenoBaggingData(data: Dict[str, Dict], train_proportion: float, sample_with_replacement: bool = True, upsampling_factor: int = 1, batch_size: Optional[int] = None, num_workers: Optional[int] = 0, pin_memory: bool = False, cache_tensors: bool = False, temp_dir: Optional[str] = None, chunksize: int = 1000, deterministic: bool = False)

Bases: pytorch_lightning.LightningDataModule

Preprocess the underlying dataframe, to then load it into a dataset object

Initialization

Initialize the MultiphenoBaggingData.

Parameters:
  • data (Dict[str, Dict]) – Underlying dataframe from which data structured into batches.

  • train_proportion (float) – Percentage by which data is divided into training/validation split.

  • sample_with_replacement (bool) – If True, a sample can be selected multiple times in one epoch. Defaults to True. (optional)

  • min_variant_count (int) – Minimum number of variants available for each gene. Defaults to 1. (optional)

  • upsampling_factor (int) – Percentual factor by which to upsample data; >= 1. Defaults to 1. (optional)

  • batch_size (Optional[int]) – Number of samples/individuals available in one batch. Defaults to None. (optional)

  • num_workers (Optional[int]) – Number of workers simultaneously putting data into RAM. Defaults to 0. (optional)

  • cache_tensors (bool) – Indicates if samples have been pre-loaded or need to be extracted from zarr. Defaults to False. (optional)

upsample() numpy.ndarray

does not work at the moment for multi-phenotype training. Needs some minor changes to make it work again

train_dataloader()

trainning samples have been selected, but to structure them and make them load as a batch they are packed in a dataset class, which is then wrapped by a dataloading object.

val_dataloader()

validation samples have been selected, but to structure them and make them load as a batch they are packed in a dataset class, which is then wrapped by a dataloading object.

deeprvat.deeprvat.train.run_bagging(config: Dict, data: Dict[str, Dict], log_dir: str, checkpoint_file: Optional[str] = None, trial: Optional[optuna.trial.Trial] = None, trial_id: Optional[int] = None, debug: bool = False, deterministic: bool = False) Optional[float]

Main function called during training. Also used for trial pruning and sampling new parameters in optuna.

Parameters:
  • config (Dict) – Dictionary containing configuration parameters, build from YAML file

  • data (Dict[str, Dict]) – Dict of phenotypes, each containing a dict storing the underlying data.

  • log_dir (str) – Path to where logs are written.

  • checkpoint_file (Optional[str]) – Path to where the weights of the trained model should be saved. (optional)

  • trial (Optional[optuna.trial.Trial]) – Optuna object generated from the study. (optional)

  • trial_id (Optional[int]) – Current trial in range n_trials. (optional)

  • debug (bool) – Use a strongly reduced dataframe

  • deterministic (bool) – Set random seeds for reproducibility

Returns:

Optional[float]: computes the lowest scores of all loss metrics and returns their average

Return type:

Optional[float]

deeprvat.deeprvat.train.train(debug: bool, deterministic: bool, training_gene_file: Optional[str], n_trials: int, trial_id: Optional[int], sample_file: Optional[str], phenotype: Tuple[Tuple[str, str, str, str]], config_file: str, log_dir: str, hpopt_file: str)

Main function called during training. Also used for trial pruning and sampling new parameters in Optuna.

Parameters:
  • debug (bool) – Use a strongly reduced dataframe

  • training_gene_file (Optional[str]) – Path to a pickle file specifying on which genes training should be executed. (optional)

  • n_trials (int) – Number of trials to be performed by the given setting.

  • trial_id (Optional[int]) – Current trial in range n_trials. (optional)

  • sample_file (Optional[str]) – Path to a pickle file specifying which samples should be considered during training. (optional)

  • phenotype (Tuple[Tuple[str, str, str, str]]) – Array of phenotypes, containing an array of paths where the underlying data is stored: - str: Phenotype name - str: Annotated gene variants as zarr file - str: Covariates each sample as zarr file - str: Ground truth phenotypes as zarr file

  • config_file (str) – Path to a YAML file, which serves for configuration.

  • log_dir (str) – Path to where logs are stored.

  • hpopt_file (str) – Path to where a .db file should be created in which the results of hyperparameter optimization are stored.

Raises:

ValueError – If no phenotype option is specified.

deeprvat.deeprvat.train.best_training_run(debug: bool, log_dir: str, checkpoint_dir: str, hpopt_db: str, config_file_out: str)

Function to extract the best trial from an Optuna study and handle associated model checkpoints and configurations.

Parameters:
  • debug (bool) – Use a strongly reduced dataframe

  • log_dir (str) – Path to where logs are stored.

  • checkpoint_dir (str) – Directory where checkpoints have been stored.

  • hpopt_db (str) – Path to the database file containing the Optuna study results.

  • config_file_out (str) – Path to store a reduced configuration file.

Returns:

None