deeprvat.deeprvat.associate
Module Contents
Functions
Compute burden scores for rare variants. |
|
Separate results from running regression on each gene. |
|
Create a dataset based on the configuration. |
|
Create a dataset based on the provided configuration and save to a pickle file. |
|
Compute burdens using the PyTorch model for each repeat. |
|
Compute burdens based on the provided model and dataset. |
|
Load a single burden score computation model from a checkpoint file. |
|
Determine if the burden score computation PyTorch model should reverse the output based on PLOF annotations. |
|
Load models from multiple checkpoints for multiple repeats. |
|
Compute burdens using the PyTorch model for each repeat. |
|
Compute burdens based on the provided model and dataset. |
|
Perform regression on a gene using the score test. |
|
Perform regression on a gene using Ordinary Least Squares (OLS). |
|
Perform regression on multiple genes. |
|
Perform regression analysis. |
|
Combine multiple regression result files. |
|
Data
API
- deeprvat.deeprvat.associate.logger = 'getLogger(...)'
- deeprvat.deeprvat.associate.PLOF_COLS = ['Consequence_stop_gained', 'Consequence_frameshift_variant', 'Consequence_stop_lost', 'Consequence_...
- deeprvat.deeprvat.associate.AGG_FCT = None
- deeprvat.deeprvat.associate.cli()
- deeprvat.deeprvat.associate.get_burden(batch: Dict, agg_models: Dict[str, List[torch.nn.Module]], device: torch.device = torch.device('cpu')) Tuple[torch.Tensor, torch.Tensor]
Compute burden scores for rare variants.
- Parameters:
batch (Dict) – A dictionary containing batched data from the DataLoader.
agg_models (Dict[str, List[nn.Module]]) – Loaded PyTorch model(s) for each repeat used for burden computation. Each key in the dictionary corresponds to a respective repeat.
device (torch.device) – Device to perform computations on, defaults to “cpu”.
skip_burdens (bool) – Flag to skip burden computation, defaults to False.
- Returns:
Tuple containing burden scores, target y phenotype values, x phenotypes and sample ids.
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
Note
Checkpoint models all corresponding to the same repeat are averaged for that repeat.
- deeprvat.deeprvat.associate.separate_parallel_results(results: List) Tuple[List, ...]
Separate results from running regression on each gene.
- Parameters:
results (List) – List of results obtained from regression analysis.
- Returns:
Tuple of lists containing separated results of regressed_genes, betas, and pvals.
- Return type:
Tuple[List, …]
- deeprvat.deeprvat.associate.make_dataset_(config: Dict, debug: bool = False, data_key: str = 'association_testing_data', skip_genotypes: bool = False, samples: Optional[List[int]] = None) torch.utils.data.Dataset
Create a dataset based on the configuration.
- Parameters:
config (Dict) – Configuration dictionary.
debug (bool) – Flag for debugging, defaults to False.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
skip_genotypes (bool) – Retrieve only covariates and phenotypes, not genotypes
samples (List[int]) – List of sample indices to include in the dataset, defaults to None.
- Returns:
Loaded instance of the created dataset.
- Return type:
Dataset
- deeprvat.deeprvat.associate.make_dataset(debug: bool, data_key: str, skip_genotypes: bool, config_file: str, out_file: str)
Create a dataset based on the provided configuration and save to a pickle file.
- Parameters:
debug (bool) – Flag for debugging.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
skip_genotypes (bool) – Retrieve only covariates and phenotypes, not genotypes
config_file (str) – Path to the configuration file.
out_file (str) – Path to the output file.
- Returns:
Created dataset saved to out_file.pkl
- deeprvat.deeprvat.associate.compute_xy_(config: Dict, ds: torch.utils.data.Dataset, data_key='association_testing_data') Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
Compute burdens using the PyTorch model for each repeat.
- Parameters:
config (Dict) – Configuration dictionary.
ds (torch.utils.data.Dataset) – Torch dataset.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
- Returns:
Tuple containing sample IDs, covariates x, and target phenotypes y
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
Note
Checkpoint models all corresponding to the same repeat are averaged for that repeat.
- deeprvat.deeprvat.associate.compute_xy(dataset_file: Optional[str], data_key: str, data_config_file: str, sample_file: pathlib.Path, x_file: pathlib.Path, y_file: pathlib.Path)
Compute burdens based on the provided model and dataset.
- Parameters:
dataset_file (Optional[str]) – Path to the dataset file, i.e., association_dataset.pkl.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
data_config_file (str) – Path to the data configuration file.
out_dir (str) – Path to the output directory.
- deeprvat.deeprvat.associate.make_regenie_input_(debug: bool, skip_covariates: bool, skip_phenotypes: bool, skip_burdens: bool, burdens_genes_samples: Optional[Tuple[pathlib.Path, pathlib.Path, pathlib.Path]], repeat: int, average_repeats: bool, phenotype: Tuple[Tuple[str, pathlib.Path, pathlib.Path, pathlib.Path]], sample_file: Optional[pathlib.Path], covariate_file: Optional[pathlib.Path], phenotype_file: Optional[pathlib.Path], bgen: Optional[pathlib.Path], gene_metadata_file: pathlib.Path, gtf: pathlib.Path)
- deeprvat.deeprvat.associate.make_regenie_input(debug: bool, skip_covariates: bool, skip_phenotypes: bool, skip_burdens: bool, burdens_genes_samples: Optional[Tuple[pathlib.Path, pathlib.Path, pathlib.Path]], repeat: int, average_repeats: bool, phenotype: Tuple[Tuple[str, pathlib.Path, pathlib.Path]], sample_file: Optional[pathlib.Path], covariate_file: Optional[pathlib.Path], phenotype_file: Optional[pathlib.Path], bgen: Optional[pathlib.Path], gene_metadata_file: pathlib.Path, gtf: pathlib.Path)
- deeprvat.deeprvat.associate.convert_regenie_output_(repeat: int, phenotype: Tuple[str, Tuple[pathlib.Path, pathlib.Path]], gene_file: pathlib.Path)
- deeprvat.deeprvat.associate.convert_regenie_output(repeat: int, phenotype: Tuple[str, Tuple[pathlib.Path, pathlib.Path]], gene_file: pathlib.Path)
- deeprvat.deeprvat.associate.load_one_model(config: Dict, checkpoint: str, device: torch.device = torch.device('cpu'))
Load a single burden score computation model from a checkpoint file.
- Parameters:
config (Dict) – Configuration dictionary.
checkpoint (str) – Path to the model checkpoint file.
device (torch.device) – Device to load the model onto, defaults to “cpu”.
- Returns:
Loaded PyTorch model for burden score computation.
- Return type:
nn.Module
- deeprvat.deeprvat.associate.reverse_models(model_config_file: str, data_config_file: str, checkpoint_files: Tuple[str])
Determine if the burden score computation PyTorch model should reverse the output based on PLOF annotations.
- Parameters:
model_config_file (str) – Path to the model configuration file.
data_config_file (str) – Path to the data configuration file.
checkpoint_files (Tuple[str]) – Paths to checkpoint files.
- Returns:
checkpoint.reverse file is created if the model should reverse the burden score output.
- deeprvat.deeprvat.associate.load_models(config: Dict, checkpoint_files: Tuple[str], device: torch.device = torch.device('cpu')) Dict[str, List[torch.nn.Module]]
Load models from multiple checkpoints for multiple repeats.
- Parameters:
config (Dict) – Configuration dictionary.
checkpoint_files (Tuple[str]) – Paths to checkpoint files.
device (torch.device) – Device to load the models onto, defaults to “cpu”.
- Returns:
Dictionary of loaded PyTorch models for burden score computation for each repeat.
- Return type:
Dict[str, List[nn.Module]]
- Examples:
>>> config = {"model": {"type": "MyModel", "config": {"param": "value"}}} >>> checkpoint_files = ("checkpoint1.pth", "checkpoint2.pth") >>> load_models(config, checkpoint_files) {'repeat_0': [MyModel(), MyModel()]}
- deeprvat.deeprvat.associate.compute_burdens_(debug: bool, config: Dict, ds: torch.utils.data.Dataset, cache_dir: str, agg_models: Dict[str, List[torch.nn.Module]], data_key: str = 'association_testing_data', n_chunks: Optional[int] = None, chunk: Optional[int] = None, device: torch.device = torch.device('cpu'), bottleneck: bool = False, compression_level: int = 1) Tuple[numpy.ndarray, zarr.core.Array, zarr.core.Array]
Compute burdens using the PyTorch model for each repeat.
- Parameters:
debug (bool) – Flag for debugging.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
config (Dict) – Configuration dictionary.
ds (torch.utils.data.Dataset) – Torch dataset.
cache_dir (str) – Directory to cache zarr files of computed burdens, x phenotypes, and y phenotypes.
agg_models (Dict[str, List[nn.Module]]) – Loaded PyTorch model(s) for each repeat used for burden computation. Each key in the dictionary corresponds to a respective repeat.
n_chunks (Optional[int]) – Number of chunks to split data for processing, defaults to None.
chunk (Optional[int]) – Index of the chunk of data, defaults to None.
device (torch.device) – Device to perform computations on, defaults to “cpu”.
bottleneck (bool) – Flag to enable bottlenecking number of batches, defaults to False.
compression_level (int) – Blosc compressor compression level for zarr files, defaults to 1.
- Returns:
Tuple containing genes, burdens, target y phenotypes, x phenotypes and sample ids.
- Return type:
Tuple[np.ndarray, zarr.core.Array, zarr.core.Array, zarr.core.Array, zarr.core.Array]
Note
Checkpoint models all corresponding to the same repeat are averaged for that repeat.
- deeprvat.deeprvat.associate.compute_burdens(debug: bool, bottleneck: bool, data_key: str, n_chunks: Optional[int], chunk: Optional[int], dataset_file: Optional[str], data_config_file: str, model_config_file: str, checkpoint_files: Tuple[str], out_dir: str)
Compute burdens based on the provided model and dataset.
- Parameters:
debug (bool) – Flag for debugging.
bottleneck (bool) – Flag to enable bottlenecking number of batches.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
n_chunks (Optional[int]) – Number of chunks to split data for processing, defaults to None.
chunk (Optional[int]) – Index of the chunk of data, defaults to None.
dataset_file (Optional[str]) – Path to the dataset file, i.e., association_dataset.pkl.
data_config_file (str) – Path to the data configuration file.
model_config_file (str) – Path to the model configuration file.
checkpoint_files (Tuple[str]) – Paths to model checkpoint files.
out_dir (str) – Path to the output directory.
- Returns:
Corresonding genes, computed burdens, y phenotypes, x phenotypes and sample ids are saved in the out_dir.
- Return type:
[np.ndarray], [zarr.core.Array], [zarr.core.Array], [zarr.core.Array], [zarr.core.Array]
Note
Checkpoint models all corresponding to the same repeat are averaged for that repeat.
- deeprvat.deeprvat.associate.combine_burden_chunks(n_chunks: int, skip_burdens: bool, overwrite: bool, burdens_chunks_dir: pathlib.Path, result_dir: pathlib.Path)
- deeprvat.deeprvat.associate.combine_burden_chunks_(n_chunks: int, burdens_chunks_dir: pathlib.Path, skip_burdens: bool, overwrite: bool, result_dir: pathlib.Path)
- deeprvat.deeprvat.associate.regress_on_gene_scoretest(gene: str, burdens: numpy.ndarray, model_score) Tuple[List[str], List[float], List[float]]
Perform regression on a gene using the score test.
- Parameters:
gene (str) – Gene name.
burdens (np.ndarray) – Burden scores associated with the gene.
model_score (Any) – Model for score test.
- Returns:
Tuple containing gene name, beta, and p-value.
- Return type:
Tuple[List[str], List[float], List[float]]
- deeprvat.deeprvat.associate.regress_on_gene(gene: str, X: numpy.ndarray, y: numpy.ndarray, x_pheno: numpy.ndarray, use_bias: bool, use_x_pheno: bool) Tuple[List[str], List[float], List[float]]
Perform regression on a gene using Ordinary Least Squares (OLS).
- Parameters:
gene (str) – Gene name.
X (np.ndarray) – Burden score data.
y (np.ndarray) – Y phenotype data.
x_pheno (np.ndarray) – X phenotype data.
use_bias (bool) – Flag to include bias term.
use_x_pheno (bool) – Flag to include x phenotype data in regression.
- Returns:
Tuple containing gene name, beta, and p-value.
- Return type:
Tuple[List[str], List[float], List[float]]
- deeprvat.deeprvat.associate.regress_(config: Dict, use_bias: bool, burdens: numpy.ndarray, y: numpy.ndarray, gene_indices: numpy.ndarray, genes: pandas.Series, x_pheno: numpy.ndarray, use_x_pheno: bool = True, do_scoretest: bool = True) pandas.DataFrame
Perform regression on multiple genes.
- Parameters:
config (Dict) – Configuration dictionary.
use_bias (bool) – Flag to include bias term when performing OLS regression.
burdens (np.ndarray) – Burden score data.
y (np.ndarray) – Y phenotype data.
gene_indices (np.ndarray) – Indices of genes.
genes (pd.Series) – Gene names.
x_pheno (np.ndarray) – X phenotype data.
use_x_pheno (bool) – Flag to include x phenotype data when performing OLS regression, defaults to True.
do_scoretest (bool) – Flag to use the scoretest from SEAK, defaults to True.
- Returns:
DataFrame containing regression results on all genes.
- Return type:
pd.DataFrame
- deeprvat.deeprvat.associate.regress(debug: bool, chunk: int, n_chunks: int, use_bias: bool, gene_file: str, config_file: str, xy_dir: str, burden_file: str, out_dir: str, do_scoretest: bool, sample_file: Optional[str])
Perform regression analysis.
- Parameters:
debug (bool) – Flag for debugging.
chunk (int) – Index of the chunk of data, defaults to 0.
n_chunks (int) – Number of chunks to split data for processing, defaults to 1.
use_bias (bool) – Flag to include bias term when performing OLS regression.
gene_file (str) – Path to the gene file.
config_file (str) – Path to the configuration file.
xy_dir (str) – Path to the directory containing the x.zarr and y.zarr files.
burden_file (str) – Path to the burdens.zarr file.
out_dir (str) – Path to the output directory.
do_scoretest (bool) – Flag to use the scoretest from SEAK.
sample_file (Optional[str]) – Path to the sample file.
- Returns:
Regression results saved to out_dir as “burden_associations_{chunk}.parquet”
- deeprvat.deeprvat.associate.combine_regression_results(result_files: Tuple[str], out_file: str, model_name: Optional[str])
Combine multiple regression result files.
- Parameters:
result_files (Tuple[str]) – List of paths to regression result files.
out_file (str) – Path to the output file.
model_name (Optional[str]) – Name of the regression model.
- Returns:
Concatenated regression results saved to a parquet file.
- deeprvat.deeprvat.associate.average_burdens(repeats: Tuple, burden_file: str, burden_out_file: str, agg_fct: Optional[str] = 'mean', n_chunks: Optional[int] = None, chunk: Optional[int] = None)
- deeprvat.deeprvat.associate.regress_common(debug: bool, chunk: int, n_chunks: int, use_bias: bool, gene_file: str, repeat: int, config_file: str, burden_dir: str, out_file: str, do_scoretest: bool, sample_file: Optional[str], burden_file: Optional[str], genes_to_keep: Optional[str], common_genotype_prefix: str)
- deeprvat.deeprvat.associate.regress_common_(config: Dict, use_bias: bool, burdens: numpy.ndarray, y: numpy.ndarray, gene_indices: numpy.ndarray, genes: pandas.Series, x_pheno: numpy.ndarray, common_genotype_prefix: str, use_x_pheno: bool = True, do_scoretest: bool = True) pandas.DataFrame