`deeprvat.deeprvat.associate`

Module Contents

Functions

`cli`
`get_burden`	Compute burden scores for rare variants.
`separate_parallel_results`	Separate results from running regression on each gene.
`make_dataset_`	Create a dataset based on the configuration.
`make_dataset`	Create a dataset based on the provided configuration and save to a pickle file.
`compute_xy_`	Compute burdens using the PyTorch model for each repeat.
`compute_xy`	Compute burdens based on the provided model and dataset.
`make_regenie_input_`
`make_regenie_input`
`convert_regenie_output_`
`convert_regenie_output`
`load_one_model`	Load a single burden score computation model from a checkpoint file.
`reverse_models`	Determine if the burden score computation PyTorch model should reverse the output based on PLOF annotations.
`load_models`	Load models from multiple checkpoints for multiple repeats.
`compute_burdens_`	Compute burdens using the PyTorch model for each repeat.
`compute_burdens`	Compute burdens based on the provided model and dataset.
`combine_burden_chunks`
`combine_burden_chunks_`
`regress_on_gene_scoretest`	Perform regression on a gene using the score test.
`regress_on_gene`	Perform regression on a gene using Ordinary Least Squares (OLS).
`regress_`	Perform regression on multiple genes.
`regress`	Perform regression analysis.
`combine_regression_results`	Combine multiple regression result files.
`average_burdens`
`regress_common`
`regress_common_`

Data

`logger`
`PLOF_COLS`
`AGG_FCT`

API

deeprvat.deeprvat.associate.logger = 'getLogger(...)'

deeprvat.deeprvat.associate.PLOF_COLS = ['Consequence_stop_gained', 'Consequence_frameshift_variant', 'Consequence_stop_lost', 'Consequence_...

deeprvat.deeprvat.associate.AGG_FCT = None

deeprvat.deeprvat.associate.cli()

deeprvat.deeprvat.associate.get_burden(batch: Dict, agg_models: Dict[str, List[torch.nn.Module]], device: torch.device = torch.device('cpu')) → Tuple[torch.Tensor, torch.Tensor]

Compute burden scores for rare variants.

Parameters:

batch (Dict) – A dictionary containing batched data from the DataLoader.
agg_models (Dict[str, List[nn.Module]]) – Loaded PyTorch model(s) for each repeat used for burden computation. Each key in the dictionary corresponds to a respective repeat.
device (torch.device) – Device to perform computations on, defaults to “cpu”.
skip_burdens (bool) – Flag to skip burden computation, defaults to False.

Returns:

Tuple containing burden scores, target y phenotype values, x phenotypes and sample ids.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.separate_parallel_results(results: List) → Tuple[List, ...]

Separate results from running regression on each gene.

Parameters:: results (List) – List of results obtained from regression analysis.
Returns:: Tuple of lists containing separated results of regressed_genes, betas, and pvals.
Return type:: Tuple[List, …]

deeprvat.deeprvat.associate.make_dataset_(config: Dict, debug: bool = False, data_key: str = 'association_testing_data', skip_genotypes: bool = False, samples: Optional[List[int]] = None) → torch.utils.data.Dataset

Create a dataset based on the configuration.

Parameters:

config (Dict) – Configuration dictionary.
debug (bool) – Flag for debugging, defaults to False.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
skip_genotypes (bool) – Retrieve only covariates and phenotypes, not genotypes
samples (List[int]) – List of sample indices to include in the dataset, defaults to None.

Returns:

Loaded instance of the created dataset.

Return type:

Dataset

deeprvat.deeprvat.associate.make_dataset(debug: bool, data_key: str, skip_genotypes: bool, config_file: str, out_file: str)

Create a dataset based on the provided configuration and save to a pickle file.

Parameters:

debug (bool) – Flag for debugging.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
skip_genotypes (bool) – Retrieve only covariates and phenotypes, not genotypes
config_file (str) – Path to the configuration file.
out_file (str) – Path to the output file.

Returns:

Created dataset saved to out_file.pkl

deeprvat.deeprvat.associate.compute_xy_(config: Dict, ds: torch.utils.data.Dataset, data_key='association_testing_data') → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Compute burdens using the PyTorch model for each repeat.

Parameters:

config (Dict) – Configuration dictionary.
ds (torch.utils.data.Dataset) – Torch dataset.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.

Returns:

Tuple containing sample IDs, covariates x, and target phenotypes y

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.compute_xy(dataset_file: Optional[str], data_key: str, data_config_file: str, sample_file: pathlib.Path, x_file: pathlib.Path, y_file: pathlib.Path)

Compute burdens based on the provided model and dataset.

Parameters:

dataset_file (Optional[str]) – Path to the dataset file, i.e., association_dataset.pkl.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
data_config_file (str) – Path to the data configuration file.
out_dir (str) – Path to the output directory.

deeprvat.deeprvat.associate.make_regenie_input_(debug: bool, skip_covariates: bool, skip_phenotypes: bool, skip_burdens: bool, burdens_genes_samples: Optional[Tuple[pathlib.Path, pathlib.Path, pathlib.Path]], repeat: int, average_repeats: bool, phenotype: Tuple[Tuple[str, pathlib.Path, pathlib.Path, pathlib.Path]], sample_file: Optional[pathlib.Path], covariate_file: Optional[pathlib.Path], phenotype_file: Optional[pathlib.Path], bgen: Optional[pathlib.Path], gene_metadata_file: pathlib.Path, gtf: pathlib.Path)

deeprvat.deeprvat.associate.make_regenie_input(debug: bool, skip_covariates: bool, skip_phenotypes: bool, skip_burdens: bool, burdens_genes_samples: Optional[Tuple[pathlib.Path, pathlib.Path, pathlib.Path]], repeat: int, average_repeats: bool, phenotype: Tuple[Tuple[str, pathlib.Path, pathlib.Path]], sample_file: Optional[pathlib.Path], covariate_file: Optional[pathlib.Path], phenotype_file: Optional[pathlib.Path], bgen: Optional[pathlib.Path], gene_metadata_file: pathlib.Path, gtf: pathlib.Path)

deeprvat.deeprvat.associate.convert_regenie_output_(repeat: int, phenotype: Tuple[str, Tuple[pathlib.Path, pathlib.Path]], gene_file: pathlib.Path)

deeprvat.deeprvat.associate.convert_regenie_output(repeat: int, phenotype: Tuple[str, Tuple[pathlib.Path, pathlib.Path]], gene_file: pathlib.Path)

deeprvat.deeprvat.associate.load_one_model(config: Dict, checkpoint: str, device: torch.device = torch.device('cpu'))

Load a single burden score computation model from a checkpoint file.

Parameters:

config (Dict) – Configuration dictionary.
checkpoint (str) – Path to the model checkpoint file.
device (torch.device) – Device to load the model onto, defaults to “cpu”.

Returns:

Loaded PyTorch model for burden score computation.

Return type:

nn.Module

deeprvat.deeprvat.associate.reverse_models(model_config_file: str, data_config_file: str, checkpoint_files: Tuple[str])

Determine if the burden score computation PyTorch model should reverse the output based on PLOF annotations.

Parameters:

model_config_file (str) – Path to the model configuration file.
data_config_file (str) – Path to the data configuration file.
checkpoint_files (Tuple[str]) – Paths to checkpoint files.

Returns:

checkpoint.reverse file is created if the model should reverse the burden score output.

deeprvat.deeprvat.associate.load_models(config: Dict, checkpoint_files: Tuple[str], device: torch.device = torch.device('cpu')) → Dict[str, List[torch.nn.Module]]

Load models from multiple checkpoints for multiple repeats.

Parameters:

config (Dict) – Configuration dictionary.
checkpoint_files (Tuple[str]) – Paths to checkpoint files.
device (torch.device) – Device to load the models onto, defaults to “cpu”.

Returns:

Dictionary of loaded PyTorch models for burden score computation for each repeat.

Return type:

Dict[str, List[nn.Module]]

Examples:

>>> config = {"model": {"type": "MyModel", "config": {"param": "value"}}}
>>> checkpoint_files = ("checkpoint1.pth", "checkpoint2.pth")
>>> load_models(config, checkpoint_files)
{'repeat_0': [MyModel(), MyModel()]}

deeprvat.deeprvat.associate.compute_burdens_(debug: bool, config: Dict, ds: torch.utils.data.Dataset, cache_dir: str, agg_models: Dict[str, List[torch.nn.Module]], data_key: str = 'association_testing_data', n_chunks: Optional[int] = None, chunk: Optional[int] = None, device: torch.device = torch.device('cpu'), bottleneck: bool = False, compression_level: int = 1) → Tuple[numpy.ndarray, zarr.core.Array, zarr.core.Array]

Compute burdens using the PyTorch model for each repeat.

Parameters:

debug (bool) – Flag for debugging.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
config (Dict) – Configuration dictionary.
ds (torch.utils.data.Dataset) – Torch dataset.
cache_dir (str) – Directory to cache zarr files of computed burdens, x phenotypes, and y phenotypes.
agg_models (Dict[str, List[nn.Module]]) – Loaded PyTorch model(s) for each repeat used for burden computation. Each key in the dictionary corresponds to a respective repeat.
n_chunks (Optional[int]) – Number of chunks to split data for processing, defaults to None.
chunk (Optional[int]) – Index of the chunk of data, defaults to None.
device (torch.device) – Device to perform computations on, defaults to “cpu”.
bottleneck (bool) – Flag to enable bottlenecking number of batches, defaults to False.
compression_level (int) – Blosc compressor compression level for zarr files, defaults to 1.

Returns:

Tuple containing genes, burdens, target y phenotypes, x phenotypes and sample ids.

Return type:

Tuple[np.ndarray, zarr.core.Array, zarr.core.Array, zarr.core.Array, zarr.core.Array]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.compute_burdens(debug: bool, bottleneck: bool, data_key: str, n_chunks: Optional[int], chunk: Optional[int], dataset_file: Optional[str], data_config_file: str, model_config_file: str, checkpoint_files: Tuple[str], out_dir: str)

Compute burdens based on the provided model and dataset.

Parameters:

debug (bool) – Flag for debugging.
bottleneck (bool) – Flag to enable bottlenecking number of batches.
data_key (str) – Key for dataset configuration in the config dictionary, defaults to “association_testing_data”.
n_chunks (Optional[int]) – Number of chunks to split data for processing, defaults to None.
chunk (Optional[int]) – Index of the chunk of data, defaults to None.
dataset_file (Optional[str]) – Path to the dataset file, i.e., association_dataset.pkl.
data_config_file (str) – Path to the data configuration file.
model_config_file (str) – Path to the model configuration file.
checkpoint_files (Tuple[str]) – Paths to model checkpoint files.
out_dir (str) – Path to the output directory.

Returns:

Corresonding genes, computed burdens, y phenotypes, x phenotypes and sample ids are saved in the out_dir.

Return type:

[np.ndarray], [zarr.core.Array], [zarr.core.Array], [zarr.core.Array], [zarr.core.Array]

Note

Checkpoint models all corresponding to the same repeat are averaged for that repeat.

deeprvat.deeprvat.associate.combine_burden_chunks(n_chunks: int, skip_burdens: bool, overwrite: bool, burdens_chunks_dir: pathlib.Path, result_dir: pathlib.Path)

deeprvat.deeprvat.associate.combine_burden_chunks_(n_chunks: int, burdens_chunks_dir: pathlib.Path, skip_burdens: bool, overwrite: bool, result_dir: pathlib.Path)

deeprvat.deeprvat.associate.regress_on_gene_scoretest(gene: str, burdens: numpy.ndarray, model_score) → Tuple[List[str], List[float], List[float]]

Perform regression on a gene using the score test.

Parameters:

gene (str) – Gene name.
burdens (np.ndarray) – Burden scores associated with the gene.
model_score (Any) – Model for score test.

Returns:

Tuple containing gene name, beta, and p-value.

Return type:

Tuple[List[str], List[float], List[float]]

deeprvat.deeprvat.associate.regress_on_gene(gene: str, X: numpy.ndarray, y: numpy.ndarray, x_pheno: numpy.ndarray, use_bias: bool, use_x_pheno: bool) → Tuple[List[str], List[float], List[float]]

Perform regression on a gene using Ordinary Least Squares (OLS).

Parameters:

gene (str) – Gene name.
X (np.ndarray) – Burden score data.
y (np.ndarray) – Y phenotype data.
x_pheno (np.ndarray) – X phenotype data.
use_bias (bool) – Flag to include bias term.
use_x_pheno (bool) – Flag to include x phenotype data in regression.

Returns:

Tuple containing gene name, beta, and p-value.

Return type:

Tuple[List[str], List[float], List[float]]

deeprvat.deeprvat.associate.regress_(config: Dict, use_bias: bool, burdens: numpy.ndarray, y: numpy.ndarray, gene_indices: numpy.ndarray, genes: pandas.Series, x_pheno: numpy.ndarray, use_x_pheno: bool = True, do_scoretest: bool = True) → pandas.DataFrame

Perform regression on multiple genes.

Parameters:

config (Dict) – Configuration dictionary.
use_bias (bool) – Flag to include bias term when performing OLS regression.
burdens (np.ndarray) – Burden score data.
y (np.ndarray) – Y phenotype data.
gene_indices (np.ndarray) – Indices of genes.
genes (pd.Series) – Gene names.
x_pheno (np.ndarray) – X phenotype data.
use_x_pheno (bool) – Flag to include x phenotype data when performing OLS regression, defaults to True.
do_scoretest (bool) – Flag to use the scoretest from SEAK, defaults to True.

Returns:

DataFrame containing regression results on all genes.

Return type:

pd.DataFrame

deeprvat.deeprvat.associate.regress(debug: bool, chunk: int, n_chunks: int, use_bias: bool, gene_file: str, config_file: str, xy_dir: str, burden_file: str, out_dir: str, do_scoretest: bool, sample_file: Optional[str])

Perform regression analysis.

Parameters:

debug (bool) – Flag for debugging.
chunk (int) – Index of the chunk of data, defaults to 0.
n_chunks (int) – Number of chunks to split data for processing, defaults to 1.
use_bias (bool) – Flag to include bias term when performing OLS regression.
gene_file (str) – Path to the gene file.
config_file (str) – Path to the configuration file.
xy_dir (str) – Path to the directory containing the x.zarr and y.zarr files.
burden_file (str) – Path to the burdens.zarr file.
out_dir (str) – Path to the output directory.
do_scoretest (bool) – Flag to use the scoretest from SEAK.
sample_file (Optional[str]) – Path to the sample file.

Returns:

Regression results saved to out_dir as “burden_associations_{chunk}.parquet”

deeprvat.deeprvat.associate.combine_regression_results(result_files: Tuple[str], out_file: str, model_name: Optional[str])

Combine multiple regression result files.

Parameters:

result_files (Tuple[str]) – List of paths to regression result files.
out_file (str) – Path to the output file.
model_name (Optional[str]) – Name of the regression model.

Returns:

Concatenated regression results saved to a parquet file.

deeprvat.deeprvat.associate.average_burdens(repeats: Tuple, burden_file: str, burden_out_file: str, agg_fct: Optional[str] = 'mean', n_chunks: Optional[int] = None, chunk: Optional[int] = None)

deeprvat.deeprvat.associate.regress_common(debug: bool, chunk: int, n_chunks: int, use_bias: bool, gene_file: str, repeat: int, config_file: str, burden_dir: str, out_file: str, do_scoretest: bool, sample_file: Optional[str], burden_file: Optional[str], genes_to_keep: Optional[str], common_genotype_prefix: str)

deeprvat.deeprvat.associate.regress_common_(config: Dict, use_bias: bool, burdens: numpy.ndarray, y: numpy.ndarray, gene_indices: numpy.ndarray, genes: pandas.Series, x_pheno: numpy.ndarray, common_genotype_prefix: str, use_x_pheno: bool = True, do_scoretest: bool = True) → pandas.DataFrame

deeprvat.deeprvat.associate

Module Contents

Functions

Data

API

`deeprvat.deeprvat.associate`