deeprvat.deeprvat.associate

Module Contents

Functions

cli

get_burden

Compute burden scores for rare variants.

separate_parallel_results

Separate results from running regression on each gene.

make_dataset_

Create a dataset based on the configuration.

make_dataset

Create a dataset based on the provided configuration and save to a pickle file.

compute_xy_

Compute burdens using the PyTorch model for each repeat.

compute_xy

Compute burdens based on the provided model and dataset.

make_regenie_input_

make_regenie_input

convert_regenie_output_

convert_regenie_output

load_one_model

Load a single burden score computation model from a checkpoint file.

reverse_models

Determine if the burden score computation PyTorch model should reverse the output based on PLOF annotations.

load_models

Load models from multiple checkpoints for multiple repeats.

compute_burdens_

Compute burdens using the PyTorch model for each repeat.

compute_burdens

Compute burdens based on the provided model and dataset.

combine_burden_chunks

combine_burden_chunks_

regress_on_gene_scoretest

Perform regression on a gene using the score test.

regress_on_gene

Perform regression on a gene using Ordinary Least Squares (OLS).

regress_

Perform regression on multiple genes.

regress

Perform regression analysis.

combine_regression_results

Combine multiple regression result files.

average_burdens

regress_common

regress_common_

Data

logger

PLOF_COLS

AGG_FCT

API

deeprvat.deeprvat.associate.logger = 'getLogger(...)'
deeprvat.deeprvat.associate.PLOF_COLS = ['Consequence_stop_gained', 'Consequence_frameshift_variant', 'Consequence_stop_lost', 'Consequence_...
deeprvat.deeprvat.associate.AGG_FCT = None
deeprvat.deeprvat.associate.cli()
deeprvat.deeprvat.associate.get_burden(batch: Dict, agg_models: Dict[str, List[torch.nn.Module]], device: torch.device = torch.device('cpu')) Tuple[torch.Tensor, torch.Tensor]

Compute burden scores for rare variants.

Parameters:
  • batch (Dict) – A dictionary containing batched data from the DataLoader.

  • agg_models (Dict[str, List[nn.Module]]) – Loaded PyTorch model(s) for each repeat used for burden computation. Each key in the dictionary corresponds to a respective repeat.

  • device (torch.device) – Device to perform computations on, defaults to “cpu”.

  • skip_burdens (bool) – Flag to skip burden computation, defaults to False.

Returns:

Tuple containing burden scores, target y phenotype values, x phenotypes and sample ids.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.separate_parallel_results(results: List) Tuple[List, ...]

Separate results from running regression on each gene.

Parameters:

results (List) – List of results obtained from regression analysis.

Returns:

Tuple of lists containing separated results of regressed_genes, betas, and pvals.

Return type:

Tuple[List, …]

deeprvat.deeprvat.associate.make_dataset_(config: Dict, debug: bool = False, data_key: str = 'association_testing_data', skip_genotypes: bool = False, samples: Optional[List[int]] = None) torch.utils.data.Dataset

Create a dataset based on the configuration.

Parameters:
  • config (Dict) – Configuration dictionary.

  • debug (bool) – Flag for debugging, defaults to False.

  • data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

  • skip_genotypes (bool) – Retrieve only covariates and phenotypes, not genotypes

  • samples (List[int]) – List of sample indices to include in the dataset, defaults to None.

Returns:

Loaded instance of the created dataset.

Return type:

Dataset

deeprvat.deeprvat.associate.make_dataset(debug: bool, data_key: str, skip_genotypes: bool, config_file: str, out_file: str)

Create a dataset based on the provided configuration and save to a pickle file.

Parameters:
  • debug (bool) – Flag for debugging.

  • data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

  • skip_genotypes (bool) – Retrieve only covariates and phenotypes, not genotypes

  • config_file (str) – Path to the configuration file.

  • out_file (str) – Path to the output file.

Returns:

Created dataset saved to out_file.pkl

deeprvat.deeprvat.associate.compute_xy_(config: Dict, ds: torch.utils.data.Dataset, data_key='association_testing_data') Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Compute burdens using the PyTorch model for each repeat.

Parameters:
  • config (Dict) – Configuration dictionary.

  • ds (torch.utils.data.Dataset) – Torch dataset.

  • data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

Returns:

Tuple containing sample IDs, covariates x, and target phenotypes y

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.compute_xy(dataset_file: Optional[str], data_key: str, data_config_file: str, sample_file: pathlib.Path, x_file: pathlib.Path, y_file: pathlib.Path)

Compute burdens based on the provided model and dataset.

Parameters:
  • dataset_file (Optional[str]) – Path to the dataset file, i.e., association_dataset.pkl.

  • data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

  • data_config_file (str) – Path to the data configuration file.

  • out_dir (str) – Path to the output directory.

deeprvat.deeprvat.associate.make_regenie_input_(debug: bool, skip_covariates: bool, skip_phenotypes: bool, skip_burdens: bool, burdens_genes_samples: Optional[Tuple[pathlib.Path, pathlib.Path, pathlib.Path]], repeat: int, average_repeats: bool, phenotype: Tuple[Tuple[str, pathlib.Path, pathlib.Path, pathlib.Path]], sample_file: Optional[pathlib.Path], covariate_file: Optional[pathlib.Path], phenotype_file: Optional[pathlib.Path], bgen: Optional[pathlib.Path], gene_metadata_file: pathlib.Path, gtf: pathlib.Path)
deeprvat.deeprvat.associate.make_regenie_input(debug: bool, skip_covariates: bool, skip_phenotypes: bool, skip_burdens: bool, burdens_genes_samples: Optional[Tuple[pathlib.Path, pathlib.Path, pathlib.Path]], repeat: int, average_repeats: bool, phenotype: Tuple[Tuple[str, pathlib.Path, pathlib.Path]], sample_file: Optional[pathlib.Path], covariate_file: Optional[pathlib.Path], phenotype_file: Optional[pathlib.Path], bgen: Optional[pathlib.Path], gene_metadata_file: pathlib.Path, gtf: pathlib.Path)
deeprvat.deeprvat.associate.convert_regenie_output_(repeat: int, phenotype: Tuple[str, Tuple[pathlib.Path, pathlib.Path]], gene_file: pathlib.Path)
deeprvat.deeprvat.associate.convert_regenie_output(repeat: int, phenotype: Tuple[str, Tuple[pathlib.Path, pathlib.Path]], gene_file: pathlib.Path)
deeprvat.deeprvat.associate.load_one_model(config: Dict, checkpoint: str, device: torch.device = torch.device('cpu'))

Load a single burden score computation model from a checkpoint file.

Parameters:
  • config (Dict) – Configuration dictionary.

  • checkpoint (str) – Path to the model checkpoint file.

  • device (torch.device) – Device to load the model onto, defaults to “cpu”.

Returns:

Loaded PyTorch model for burden score computation.

Return type:

nn.Module

deeprvat.deeprvat.associate.reverse_models(model_config_file: str, data_config_file: str, checkpoint_files: Tuple[str])

Determine if the burden score computation PyTorch model should reverse the output based on PLOF annotations.

Parameters:
  • model_config_file (str) – Path to the model configuration file.

  • data_config_file (str) – Path to the data configuration file.

  • checkpoint_files (Tuple[str]) – Paths to checkpoint files.

Returns:

checkpoint.reverse file is created if the model should reverse the burden score output.

deeprvat.deeprvat.associate.load_models(config: Dict, checkpoint_files: Tuple[str], device: torch.device = torch.device('cpu')) Dict[str, List[torch.nn.Module]]

Load models from multiple checkpoints for multiple repeats.

Parameters:
  • config (Dict) – Configuration dictionary.

  • checkpoint_files (Tuple[str]) – Paths to checkpoint files.

  • device (torch.device) – Device to load the models onto, defaults to “cpu”.

Returns:

Dictionary of loaded PyTorch models for burden score computation for each repeat.

Return type:

Dict[str, List[nn.Module]]

Examples:

>>> config = {"model": {"type": "MyModel", "config": {"param": "value"}}}
>>> checkpoint_files = ("checkpoint1.pth", "checkpoint2.pth")
>>> load_models(config, checkpoint_files)
{'repeat_0': [MyModel(), MyModel()]}
deeprvat.deeprvat.associate.compute_burdens_(debug: bool, config: Dict, ds: torch.utils.data.Dataset, cache_dir: str, agg_models: Dict[str, List[torch.nn.Module]], data_key: str = 'association_testing_data', n_chunks: Optional[int] = None, chunk: Optional[int] = None, device: torch.device = torch.device('cpu'), bottleneck: bool = False, compression_level: int = 1) Tuple[numpy.ndarray, zarr.core.Array, zarr.core.Array]

Compute burdens using the PyTorch model for each repeat.

Parameters:
  • debug (bool) – Flag for debugging.

  • data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

  • config (Dict) – Configuration dictionary.

  • ds (torch.utils.data.Dataset) – Torch dataset.

  • cache_dir (str) – Directory to cache zarr files of computed burdens, x phenotypes, and y phenotypes.

  • agg_models (Dict[str, List[nn.Module]]) – Loaded PyTorch model(s) for each repeat used for burden computation. Each key in the dictionary corresponds to a respective repeat.

  • n_chunks (Optional[int]) – Number of chunks to split data for processing, defaults to None.

  • chunk (Optional[int]) – Index of the chunk of data, defaults to None.

  • device (torch.device) – Device to perform computations on, defaults to “cpu”.

  • bottleneck (bool) – Flag to enable bottlenecking number of batches, defaults to False.

  • compression_level (int) – Blosc compressor compression level for zarr files, defaults to 1.

Returns:

Tuple containing genes, burdens, target y phenotypes, x phenotypes and sample ids.

Return type:

Tuple[np.ndarray, zarr.core.Array, zarr.core.Array, zarr.core.Array, zarr.core.Array]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.compute_burdens(debug: bool, bottleneck: bool, data_key: str, n_chunks: Optional[int], chunk: Optional[int], dataset_file: Optional[str], data_config_file: str, model_config_file: str, checkpoint_files: Tuple[str], out_dir: str)

Compute burdens based on the provided model and dataset.

Parameters:
  • debug (bool) – Flag for debugging.

  • bottleneck (bool) – Flag to enable bottlenecking number of batches.

  • data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

  • n_chunks (Optional[int]) – Number of chunks to split data for processing, defaults to None.

  • chunk (Optional[int]) – Index of the chunk of data, defaults to None.

  • dataset_file (Optional[str]) – Path to the dataset file, i.e., association_dataset.pkl.

  • data_config_file (str) – Path to the data configuration file.

  • model_config_file (str) – Path to the model configuration file.

  • checkpoint_files (Tuple[str]) – Paths to model checkpoint files.

  • out_dir (str) – Path to the output directory.

Returns:

Corresonding genes, computed burdens, y phenotypes, x phenotypes and sample ids are saved in the out_dir.

Return type:

[np.ndarray], [zarr.core.Array], [zarr.core.Array], [zarr.core.Array], [zarr.core.Array]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.combine_burden_chunks(n_chunks: int, skip_burdens: bool, overwrite: bool, burdens_chunks_dir: pathlib.Path, result_dir: pathlib.Path)
deeprvat.deeprvat.associate.combine_burden_chunks_(n_chunks: int, burdens_chunks_dir: pathlib.Path, skip_burdens: bool, overwrite: bool, result_dir: pathlib.Path)
deeprvat.deeprvat.associate.regress_on_gene_scoretest(gene: str, burdens: numpy.ndarray, model_score) Tuple[List[str], List[float], List[float]]

Perform regression on a gene using the score test.

Parameters:
  • gene (str) – Gene name.

  • burdens (np.ndarray) – Burden scores associated with the gene.

  • model_score (Any) – Model for score test.

Returns:

Tuple containing gene name, beta, and p-value.

Return type:

Tuple[List[str], List[float], List[float]]

deeprvat.deeprvat.associate.regress_on_gene(gene: str, X: numpy.ndarray, y: numpy.ndarray, x_pheno: numpy.ndarray, use_bias: bool, use_x_pheno: bool) Tuple[List[str], List[float], List[float]]

Perform regression on a gene using Ordinary Least Squares (OLS).

Parameters:
  • gene (str) – Gene name.

  • X (np.ndarray) – Burden score data.

  • y (np.ndarray) – Y phenotype data.

  • x_pheno (np.ndarray) – X phenotype data.

  • use_bias (bool) – Flag to include bias term.

  • use_x_pheno (bool) – Flag to include x phenotype data in regression.

Returns:

Tuple containing gene name, beta, and p-value.

Return type:

Tuple[List[str], List[float], List[float]]

deeprvat.deeprvat.associate.regress_(config: Dict, use_bias: bool, burdens: numpy.ndarray, y: numpy.ndarray, gene_indices: numpy.ndarray, genes: pandas.Series, x_pheno: numpy.ndarray, use_x_pheno: bool = True, do_scoretest: bool = True) pandas.DataFrame

Perform regression on multiple genes.

Parameters:
  • config (Dict) – Configuration dictionary.

  • use_bias (bool) – Flag to include bias term when performing OLS regression.

  • burdens (np.ndarray) – Burden score data.

  • y (np.ndarray) – Y phenotype data.

  • gene_indices (np.ndarray) – Indices of genes.

  • genes (pd.Series) – Gene names.

  • x_pheno (np.ndarray) – X phenotype data.

  • use_x_pheno (bool) – Flag to include x phenotype data when performing OLS regression, defaults to True.

  • do_scoretest (bool) – Flag to use the scoretest from SEAK, defaults to True.

Returns:

DataFrame containing regression results on all genes.

Return type:

pd.DataFrame

deeprvat.deeprvat.associate.regress(debug: bool, chunk: int, n_chunks: int, use_bias: bool, gene_file: str, config_file: str, xy_dir: str, burden_file: str, out_dir: str, do_scoretest: bool, sample_file: Optional[str])

Perform regression analysis.

Parameters:
  • debug (bool) – Flag for debugging.

  • chunk (int) – Index of the chunk of data, defaults to 0.

  • n_chunks (int) – Number of chunks to split data for processing, defaults to 1.

  • use_bias (bool) – Flag to include bias term when performing OLS regression.

  • gene_file (str) – Path to the gene file.

  • config_file (str) – Path to the configuration file.

  • xy_dir (str) – Path to the directory containing the x.zarr and y.zarr files.

  • burden_file (str) – Path to the burdens.zarr file.

  • out_dir (str) – Path to the output directory.

  • do_scoretest (bool) – Flag to use the scoretest from SEAK.

  • sample_file (Optional[str]) – Path to the sample file.

Returns:

Regression results saved to out_dir as “burden_associations_{chunk}.parquet”

deeprvat.deeprvat.associate.combine_regression_results(result_files: Tuple[str], out_file: str, model_name: Optional[str])

Combine multiple regression result files.

Parameters:
  • result_files (Tuple[str]) – List of paths to regression result files.

  • out_file (str) – Path to the output file.

  • model_name (Optional[str]) – Name of the regression model.

Returns:

Concatenated regression results saved to a parquet file.

deeprvat.deeprvat.associate.average_burdens(repeats: Tuple, burden_file: str, burden_out_file: str, agg_fct: Optional[str] = 'mean', n_chunks: Optional[int] = None, chunk: Optional[int] = None)
deeprvat.deeprvat.associate.regress_common(debug: bool, chunk: int, n_chunks: int, use_bias: bool, gene_file: str, repeat: int, config_file: str, burden_dir: str, out_file: str, do_scoretest: bool, sample_file: Optional[str], burden_file: Optional[str], genes_to_keep: Optional[str], common_genotype_prefix: str)
deeprvat.deeprvat.associate.regress_common_(config: Dict, use_bias: bool, burdens: numpy.ndarray, y: numpy.ndarray, gene_indices: numpy.ndarray, genes: pandas.Series, x_pheno: numpy.ndarray, common_genotype_prefix: str, use_x_pheno: bool = True, do_scoretest: bool = True) pandas.DataFrame