deeprvat.deeprvat.train
Module Contents
Classes
class used to structure the data and present a __getitem__ function to the dataloader, that will be used to load batches into the model |
|
Preprocess the underlying dataframe, to then load it into a dataset object |
Functions
Subfunction of make_dataset() Convert a dataset file to the sparse format used for training and testing associations |
|
Uses function make_dataset_() to convert dataset to sparse format and stores the respective data |
|
Main function called during training. Also used for trial pruning and sampling new parameters in optuna. |
|
Main function called during training. Also used for trial pruning and sampling new parameters in Optuna. |
|
Function to extract the best trial from an Optuna study and handle associated model checkpoints and configurations. |
Data
API
- deeprvat.deeprvat.train.logger = 'getLogger(...)'
- deeprvat.deeprvat.train.METRICS = None
- deeprvat.deeprvat.train.OPTIMIZERS = None
- deeprvat.deeprvat.train.ACTIVATIONS = None
- deeprvat.deeprvat.train.DEFAULT_OPTIMIZER = None
- deeprvat.deeprvat.train.cli()
- deeprvat.deeprvat.train.subset_samples(input_tensor: torch.Tensor, covariates: torch.Tensor, y: torch.Tensor, min_variant_count: int) Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- deeprvat.deeprvat.train.make_dataset_(debug: bool, pickle_only: bool, compression_level: int, training_dataset_file: Optional[str], config_file: Union[str, pathlib.Path], input_tensor_out_file: str, covariates_out_file: str, y_out_file: str)
Subfunction of make_dataset() Convert a dataset file to the sparse format used for training and testing associations
- Parameters:
config (Dict) – Dictionary containing configuration parameters, build from YAML file
debug (bool) – Use a strongly reduced dataframe (optional)
training_dataset_file (str) – Path to the file in which training data is stored. (optional)
pickle_only (bool) – If True, only store dataset as pickle file and return None. (optional)
- Returns:
Tuple containing input_tensor, covariates, and target values.
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- deeprvat.deeprvat.train.make_dataset(debug: bool, pickle_only: bool, compression_level: int, training_dataset_file: Optional[str], config_file: str, input_tensor_out_file: str, covariates_out_file: str, y_out_file: str)
Uses function make_dataset_() to convert dataset to sparse format and stores the respective data
- Parameters:
debug (bool) – Use a strongly reduced dataframe
pickle_only (bool) – Flag to indicate whether only to save data using pickle
compression_level (int) – Level of compression in ZARR to be applied to training data.
training_dataset_file (Optional[str]) – Path to the file in which training data is stored. (optional)
config_file (str) – Path to a YAML file, which serves for configuration.
input_tensor_out_file (str) – Path to save the training data to.
covariates_out_file (str) – Path to save the covariates to.
y_out_file (str) – Path to save the ground truth data to.
- Returns:
None
- class deeprvat.deeprvat.train.MultiphenoDataset(data: Dict[str, Dict], batch_size: int, split: str = 'train', cache_tensors: bool = False, temp_dir: Optional[str] = None, chunksize: int = 1000)
Bases:
torch.utils.data.Datasetclass used to structure the data and present a __getitem__ function to the dataloader, that will be used to load batches into the model
Initialization
Initialize the MultiphenoDataset.
- Parameters:
data (Dict[str, Dict]) – Underlying dataframe from which data is structured into batches.
min_variant_count (int) – Minimum number of variants available for each gene.
batch_size (int) – Number of samples/individuals available in one batch.
split (str) – Contains a prefix indicating the dataset the model operates on. Defaults to “train”. (optional)
cache_tensors (bool) – Indicates if samples have been pre-loaded or need to be extracted from zarr. (optional)
- __len__()
Denotes the total number of batches
- __getitem__(index)
Generates one batch of data
- index_input_tensor_zarr(pheno: str, indices: numpy.ndarray)
- class deeprvat.deeprvat.train.MultiphenoBaggingData(data: Dict[str, Dict], train_proportion: float, sample_with_replacement: bool = True, upsampling_factor: int = 1, batch_size: Optional[int] = None, num_workers: Optional[int] = 0, pin_memory: bool = False, cache_tensors: bool = False, temp_dir: Optional[str] = None, chunksize: int = 1000, deterministic: bool = False)
Bases:
pytorch_lightning.LightningDataModulePreprocess the underlying dataframe, to then load it into a dataset object
Initialization
Initialize the MultiphenoBaggingData.
- Parameters:
data (Dict[str, Dict]) – Underlying dataframe from which data structured into batches.
train_proportion (float) – Percentage by which data is divided into training/validation split.
sample_with_replacement (bool) – If True, a sample can be selected multiple times in one epoch. Defaults to True. (optional)
min_variant_count (int) – Minimum number of variants available for each gene. Defaults to 1. (optional)
upsampling_factor (int) – Percentual factor by which to upsample data; >= 1. Defaults to 1. (optional)
batch_size (Optional[int]) – Number of samples/individuals available in one batch. Defaults to None. (optional)
num_workers (Optional[int]) – Number of workers simultaneously putting data into RAM. Defaults to 0. (optional)
cache_tensors (bool) – Indicates if samples have been pre-loaded or need to be extracted from zarr. Defaults to False. (optional)
- upsample() numpy.ndarray
does not work at the moment for multi-phenotype training. Needs some minor changes to make it work again
- train_dataloader()
trainning samples have been selected, but to structure them and make them load as a batch they are packed in a dataset class, which is then wrapped by a dataloading object.
- val_dataloader()
validation samples have been selected, but to structure them and make them load as a batch they are packed in a dataset class, which is then wrapped by a dataloading object.
- deeprvat.deeprvat.train.run_bagging(config: Dict, data: Dict[str, Dict], log_dir: str, checkpoint_file: Optional[str] = None, trial: Optional[optuna.trial.Trial] = None, trial_id: Optional[int] = None, debug: bool = False, deterministic: bool = False) Optional[float]
Main function called during training. Also used for trial pruning and sampling new parameters in optuna.
- Parameters:
config (Dict) – Dictionary containing configuration parameters, build from YAML file
data (Dict[str, Dict]) – Dict of phenotypes, each containing a dict storing the underlying data.
log_dir (str) – Path to where logs are written.
checkpoint_file (Optional[str]) – Path to where the weights of the trained model should be saved. (optional)
trial (Optional[optuna.trial.Trial]) – Optuna object generated from the study. (optional)
trial_id (Optional[int]) – Current trial in range n_trials. (optional)
debug (bool) – Use a strongly reduced dataframe
deterministic (bool) – Set random seeds for reproducibility
- Returns:
Optional[float]: computes the lowest scores of all loss metrics and returns their average
- Return type:
Optional[float]
- deeprvat.deeprvat.train.train(debug: bool, deterministic: bool, training_gene_file: Optional[str], n_trials: int, trial_id: Optional[int], sample_file: Optional[str], phenotype: Tuple[Tuple[str, str, str, str]], config_file: str, log_dir: str, hpopt_file: str)
Main function called during training. Also used for trial pruning and sampling new parameters in Optuna.
- Parameters:
debug (bool) – Use a strongly reduced dataframe
training_gene_file (Optional[str]) – Path to a pickle file specifying on which genes training should be executed. (optional)
n_trials (int) – Number of trials to be performed by the given setting.
trial_id (Optional[int]) – Current trial in range n_trials. (optional)
sample_file (Optional[str]) – Path to a pickle file specifying which samples should be considered during training. (optional)
phenotype (Tuple[Tuple[str, str, str, str]]) – Array of phenotypes, containing an array of paths where the underlying data is stored: - str: Phenotype name - str: Annotated gene variants as zarr file - str: Covariates each sample as zarr file - str: Ground truth phenotypes as zarr file
config_file (str) – Path to a YAML file, which serves for configuration.
log_dir (str) – Path to where logs are stored.
hpopt_file (str) – Path to where a .db file should be created in which the results of hyperparameter optimization are stored.
- Raises:
ValueError – If no phenotype option is specified.
- deeprvat.deeprvat.train.best_training_run(debug: bool, log_dir: str, checkpoint_dir: str, hpopt_db: str, config_file_out: str)
Function to extract the best trial from an Optuna study and handle associated model checkpoints and configurations.
- Parameters:
debug (bool) – Use a strongly reduced dataframe
log_dir (str) – Path to where logs are stored.
checkpoint_dir (str) – Directory where checkpoints have been stored.
hpopt_db (str) – Path to the database file containing the Optuna study results.
config_file_out (str) – Path to store a reduced configuration file.
- Returns:
None