`deeprvat.deeprvat.train`

Module Contents

Classes

`MultiphenoDataset`	class used to structure the data and present a __getitem__ function to the dataloader, that will be used to load batches into the model
`MultiphenoBaggingData`	Preprocess the underlying dataframe, to then load it into a dataset object

Functions

`cli`
`subset_samples`
`make_dataset_`	Subfunction of make_dataset() Convert a dataset file to the sparse format used for training and testing associations
`make_dataset`	Uses function make_dataset_() to convert dataset to sparse format and stores the respective data
`run_bagging`	Main function called during training. Also used for trial pruning and sampling new parameters in optuna.
`train`	Main function called during training. Also used for trial pruning and sampling new parameters in Optuna.
`best_training_run`	Function to extract the best trial from an Optuna study and handle associated model checkpoints and configurations.

Data

`logger`
`METRICS`
`OPTIMIZERS`
`ACTIVATIONS`
`DEFAULT_OPTIMIZER`

API

deeprvat.deeprvat.train.logger = 'getLogger(...)'

deeprvat.deeprvat.train.METRICS = None

deeprvat.deeprvat.train.OPTIMIZERS = None

deeprvat.deeprvat.train.ACTIVATIONS = None

deeprvat.deeprvat.train.DEFAULT_OPTIMIZER = None

deeprvat.deeprvat.train.cli()

deeprvat.deeprvat.train.subset_samples(input_tensor: torch.Tensor, covariates: torch.Tensor, y: torch.Tensor, min_variant_count: int) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

deeprvat.deeprvat.train.make_dataset_(debug: bool, pickle_only: bool, compression_level: int, training_dataset_file: Optional[str], config_file: Union[str, pathlib.Path], input_tensor_out_file: str, covariates_out_file: str, y_out_file: str)

Subfunction of make_dataset() Convert a dataset file to the sparse format used for training and testing associations

Parameters:

config (Dict) – Dictionary containing configuration parameters, build from YAML file
debug (bool) – Use a strongly reduced dataframe (optional)
training_dataset_file (str) – Path to the file in which training data is stored. (optional)
pickle_only (bool) – If True, only store dataset as pickle file and return None. (optional)

Returns:

Tuple containing input_tensor, covariates, and target values.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

deeprvat.deeprvat.train.make_dataset(debug: bool, pickle_only: bool, compression_level: int, training_dataset_file: Optional[str], config_file: str, input_tensor_out_file: str, covariates_out_file: str, y_out_file: str)

Uses function make_dataset_() to convert dataset to sparse format and stores the respective data

Parameters:

debug (bool) – Use a strongly reduced dataframe
pickle_only (bool) – Flag to indicate whether only to save data using pickle
compression_level (int) – Level of compression in ZARR to be applied to training data.
training_dataset_file (Optional[str]) – Path to the file in which training data is stored. (optional)
config_file (str) – Path to a YAML file, which serves for configuration.
input_tensor_out_file (str) – Path to save the training data to.
covariates_out_file (str) – Path to save the covariates to.
y_out_file (str) – Path to save the ground truth data to.

Returns:

None

class deeprvat.deeprvat.train.MultiphenoDataset(data: Dict[str, Dict], batch_size: int, split: str = 'train', cache_tensors: bool = False, temp_dir: Optional[str] = None, chunksize: int = 1000)

Bases: torch.utils.data.Dataset

class used to structure the data and present a __getitem__ function to the dataloader, that will be used to load batches into the model

Initialization

Initialize the MultiphenoDataset.

Parameters:

data (Dict[str, Dict]) – Underlying dataframe from which data is structured into batches.
min_variant_count (int) – Minimum number of variants available for each gene.
batch_size (int) – Number of samples/individuals available in one batch.
split (str) – Contains a prefix indicating the dataset the model operates on. Defaults to “train”. (optional)
cache_tensors (bool) – Indicates if samples have been pre-loaded or need to be extracted from zarr. (optional)

__len__(): Denotes the total number of batches

__getitem__(index): Generates one batch of data

index_input_tensor_zarr(pheno: str, indices: numpy.ndarray)

class deeprvat.deeprvat.train.MultiphenoBaggingData(data: Dict[str, Dict], train_proportion: float, sample_with_replacement: bool = True, upsampling_factor: int = 1, batch_size: Optional[int] = None, num_workers: Optional[int] = 0, pin_memory: bool = False, cache_tensors: bool = False, temp_dir: Optional[str] = None, chunksize: int = 1000, deterministic: bool = False)

Bases: pytorch_lightning.LightningDataModule

Preprocess the underlying dataframe, to then load it into a dataset object

Initialization

Initialize the MultiphenoBaggingData.

Parameters:

data (Dict[str, Dict]) – Underlying dataframe from which data structured into batches.
train_proportion (float) – Percentage by which data is divided into training/validation split.
sample_with_replacement (bool) – If True, a sample can be selected multiple times in one epoch. Defaults to True. (optional)
min_variant_count (int) – Minimum number of variants available for each gene. Defaults to 1. (optional)
upsampling_factor (int) – Percentual factor by which to upsample data; >= 1. Defaults to 1. (optional)
batch_size (Optional[int]) – Number of samples/individuals available in one batch. Defaults to None. (optional)
num_workers (Optional[int]) – Number of workers simultaneously putting data into RAM. Defaults to 0. (optional)
cache_tensors (bool) – Indicates if samples have been pre-loaded or need to be extracted from zarr. Defaults to False. (optional)

upsample() → numpy.ndarray: does not work at the moment for multi-phenotype training. Needs some minor changes to make it work again

train_dataloader(): trainning samples have been selected, but to structure them and make them load as a batch they are packed in a dataset class, which is then wrapped by a dataloading object.

val_dataloader(): validation samples have been selected, but to structure them and make them load as a batch they are packed in a dataset class, which is then wrapped by a dataloading object.

deeprvat.deeprvat.train.run_bagging(config: Dict, data: Dict[str, Dict], log_dir: str, checkpoint_file: Optional[str] = None, trial: Optional[optuna.trial.Trial] = None, trial_id: Optional[int] = None, debug: bool = False, deterministic: bool = False) → Optional[float]

Main function called during training. Also used for trial pruning and sampling new parameters in optuna.

Parameters:

config (Dict) – Dictionary containing configuration parameters, build from YAML file
data (Dict[str, Dict]) – Dict of phenotypes, each containing a dict storing the underlying data.
log_dir (str) – Path to where logs are written.
checkpoint_file (Optional[str]) – Path to where the weights of the trained model should be saved. (optional)
trial (Optional[optuna.trial.Trial]) – Optuna object generated from the study. (optional)
trial_id (Optional[int]) – Current trial in range n_trials. (optional)
debug (bool) – Use a strongly reduced dataframe
deterministic (bool) – Set random seeds for reproducibility

Returns:

Optional[float]: computes the lowest scores of all loss metrics and returns their average

Return type:

Optional[float]

deeprvat.deeprvat.train.train(debug: bool, deterministic: bool, training_gene_file: Optional[str], n_trials: int, trial_id: Optional[int], sample_file: Optional[str], phenotype: Tuple[Tuple[str, str, str, str]], config_file: str, log_dir: str, hpopt_file: str)

Main function called during training. Also used for trial pruning and sampling new parameters in Optuna.

Parameters:

debug (bool) – Use a strongly reduced dataframe
training_gene_file (Optional[str]) – Path to a pickle file specifying on which genes training should be executed. (optional)
n_trials (int) – Number of trials to be performed by the given setting.
trial_id (Optional[int]) – Current trial in range n_trials. (optional)
sample_file (Optional[str]) – Path to a pickle file specifying which samples should be considered during training. (optional)
phenotype (Tuple[Tuple[str, str, str, str]]) – Array of phenotypes, containing an array of paths where the underlying data is stored: - str: Phenotype name - str: Annotated gene variants as zarr file - str: Covariates each sample as zarr file - str: Ground truth phenotypes as zarr file
config_file (str) – Path to a YAML file, which serves for configuration.
log_dir (str) – Path to where logs are stored.
hpopt_file (str) – Path to where a .db file should be created in which the results of hyperparameter optimization are stored.

Raises:

ValueError – If no phenotype option is specified.

deeprvat.deeprvat.train.best_training_run(debug: bool, log_dir: str, checkpoint_dir: str, hpopt_db: str, config_file_out: str)

Function to extract the best trial from an Optuna study and handle associated model checkpoints and configurations.

Parameters:

debug (bool) – Use a strongly reduced dataframe
log_dir (str) – Path to where logs are stored.
checkpoint_dir (str) – Directory where checkpoints have been stored.
hpopt_db (str) – Path to the database file containing the Optuna study results.
config_file_out (str) – Path to store a reduced configuration file.

Returns:

None

deeprvat.deeprvat.train

Module Contents

Classes

Functions

Data

API

`deeprvat.deeprvat.train`