--- orphan: true --- # Training and association testing with DeepRVAT We have developed multiple modes of running DeepRVAT to suit your needs. Below are listed various running setups that entail just training DeepRVAT, using pretrained DeepRVAT models for association testing, using precomputed burdens for association testing, including REGENIE in training and association testing and also combinations of these scenarios. The general procedure is to have the relevant input data for a given setup appropriately prepared, which may include having already completed the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) and [annotation pipeline](https://deeprvat.readthedocs.io/en/latest/annotations.html). (common requirements for input data)= ## Input data: Common requirements for all pipelines An example overview of what your `experiment` directory should contain can be seen here: `[path_to_deeprvat]/example/` Replace `[path_to_deeprvat]` with the path to your clone of the repository. Note that the example data contained within the example directory is randomly generated, and is only suited for testing. - `genotypes.h5` contains the genotypes for all samples in a custom sparse format. The sample ids in the `samples` dataset are the same as in the VCF files the `genotypes.h5` has been read from. This is output by the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). - `variants.parquet` contains variant characteristics (`chrom`, `pos`, `ref`, `alt`) and the assigned variant `id` for all unique variants in `genotypes.h5`. This is output from the input VCF files using the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). - `annotations.parquet` contains the variant annotations for all variants in `variants.parquet`, which is an output from the annotation pipeline. Each variant is identified by its `id`. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). - `protein_coding_genes.parquet` Maps the integer `gene_id` used in `annotations.parquet` to standard gene IDs (EnsemblID and HGNC gene name). This is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). - `config.yaml` contains the configuration parameters for setting phenotypes, training data, model, training, and association data variables. - `phenotypes.parquet` contains the measured phenotypes for all samples (see `[path_to_deeprvat]/example/`). The row index must be the sample id as strings (same ids as used in the VCF file) and the column names the phenotype name. Phenotypes can be quantitative or binary (0,1). Use `NA` for missing values. Samples missing in `phenotypes.parquet` won't be used in DeepRVAT training/testing. The user must generate this file as it's not output by the preprocessing/annotation pipeline. This file must also contain all covariates that should be used during training/association testing (e.g., genetic sex, age, genetic principal components). - `baseline_results` directory containing the results of the seed gene discovery pipline. Insturctions [here](seed_gene_discovery.md) (common configuration parameters)= ## Configuration file: Common parameters The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. It also allows to set many other configurations detailed below. `config['training_data']` contains the relevant specifications for the training dataset creation. `config['data']` contains the relevant specifications for the association dataset creation. ### Baseline results `config['baseline_results']` specifies paths to results from the seed gene discovery pipeline (Burden/SKAT test with pLoF and missense variants). When using the seed gene discovery pipeline provided with this package, simply link the directory as 'baseline_results' in the experiment directory without any further changes. If you want to provide custom baseline results (already combined across tests), store them like `baseline_results/{phenotype}/combined_baseline/eval/burden_associations.parquet` and set the `baseline_results` in the config to ``` - base: baseline_results type: combined_baseline ``` Baseline files have to be provided for each `{phenotype}` in `config['training']['phenotypes']`. The `burden_associations.parquet` must have the columns `gene` (gene id as assigned in `protein_coding_genes.parquet`) and `pval` (see `[path_to_deeprvat]/example/baseline_results`). ### Phenotypes `config['phenotypes]` should consist of a complete list of phenotypes. To change phenotypes used during training, use `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under `config['phenotypes]` will subsequently be used only for association testing. All phenotypes listed either in `config['phenotypes']` or `config['training']['phenotypes']` have to be in the column names of `phenotypes.parquet`. ### Customizing the input data via the config file #### Data transformation The pipeline supports z-score standardization (`standardize`) and quantile transformation (`quantile_transform`) as transformation to of the target phenotypes. It has to be set in `config[key]['dataset_config']['y_transformation]`, where `key` is `training_data` or `data` to transform the training data and association testing data, respectively. For the annotations and the covariates, we allow standardization via `config[key]['dataset_config']['standardize_xpheno'] = True` (default = True) and `config[key]['dataset_config']['standardize_anno'] = True` (default = False). If custom transformations are whished, we recommend to replace the respective columns in `phenotypes.parquet` or `annotations.parquet` with the transformed values. #### Variant annotations All variant anntations that should be included in DeepRVAT's variant annotation vectors have to be listed in `config[key]['dataset_config']['annotations']` and `config[key]['dataset_config']['rare_embedding']['config']['annotations']` (this will be simplified in future). Any annotation that is used for variant filtering `config[key]['dataset_config']['rare_embedding']['config']['thresholds']` also has to be included in `config[key]['dataset_config']['annotations']`. #### Variant minor allele frequency filter To filter for variants with a MAF below a certain value (e.g., UKB_MAF < 0.1%), use: `config[key]['dataset_config']['rare_embedding']['config']['thresholds']['UKB_MAF'] = "UKB_MAF < 1e-3"`. In this example, `UKB_MAF` represents the MAF column from `annotations.parquet` here denoting MAF in the UK Biobank. #### Additional variant filters Additional variant filters can be added via `config[key]['dataset_config']['rare_embedding']['config']['thresholds'][{anno}] = "{anno} > X"`. For example, `config['data]['dataset_config']['rare_embedding']['config']['thresholds']['CADD_PHRED'] = "CADD_PHRED > 5"` will only include variants with a CADD score > 5 during association testing. Mind that all annotations used in the `threshold` section also have to be listed in `config[key]['dataset_config']['annotations']`. #### Subsetting samples To specify a sample file for training or association testing, use: `config[key]['dataset_config']['sample_file']`. Only `.pkl` files containing a list of sample IDs (string) are supported at the moment. For example, if DeepRVAT training and association testing should be done on two separate data sets, you can provide two sample files `training_samples.pkl` and `test_samples.pkl` via `config['training_data']['dataset_config']['sample_file] = training_samples.pkl` and `config['data']['dataset_config']['sample_file] = test_samples.pkl`. ## Association testing using precomputed burdens _Coming soon_ (Association_testing)= ## Association testing using pretrained models If you already have a pretrained DeepRVAT model, we have setup pipelines for running only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. ### Input data The following files should be contained within your `experiment` directory: - `config.yaml` - `genotypes.h5` - `variants.parquet` - `annotations.parquet` - `phenotypes.parquet` - `protein_coding_genes.parquet` ### Configuration file The annotations in `config['data']['dataset_config']['rare_embedding']['config']['annotations']` must be the same (and in the same order) as in `config['data']['dataset_config']['rare_embedding']['config']['annotations']` from the pre-trained model. If you use the pre-trained DeepRVAT model provided with this package, use `config['data']['dataset_config']['rare_embedding']['config']['annotations']` from the `[path_to_deeprvat]/example/config.yaml` to ensure the ordering of annotations is correct. ### Running the association testing pipeline with REGENIE _Coming soon_ ## Training To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have set up a training pipeline. ### Input data The following files should be contained within your `experiment` directory: - `config.yaml` - `genotypes.h5` - `variants.parquet` - `annotations.parquet` - `phenotypes.parquet` - `protein_coding_genes.parquet` - `baseline_results` directory where `[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile` has been run ### Configuration file Changes to the model architecture and training parameters can be made via `config['training']`, `config['pl_trainer']`, `config['early_stopping']`, `config['model']`. Per default, DeepRVAT scores are ensembled from 6 models. This can be changed via `config['n_repeats']`. ### Running the training pipeline ```shell cd experiment snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile ``` ## Training and association testing using cross-validation DeepRVAT offers a CV scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This is repeated for all folds, yielding DeepRVAT scores for all samples. ### Input data and configuration file The following files should be contained within your `experiment` directory: - `config.yaml` - `genotypes.h5` - `variants.parquet` - `annotations.parquet` - `phenotypes.parquet` - `protein_coding_genes.parquet` - `baseline_results` directory - `sample_files` provides training and test samples for each cross-validation fold as pickle files. ### Config and sample files For running 5-fold cross-validation include the following configuration in the config: ``` cv_path: sample_files n_folds: 5 ``` Provide sample files structured as `sample_files/5_fold/samples_{split}{fold}.pkl`, where `{split}` represents train/test and `{fold}` is a number from `0 to 4`. ### Run the pipeline ```shell cd experiment snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/cv_training/cv_training_association_testing.snakefile ``` ## Running only a portion of any pipeline The snakemake pipelines outlined above are compromised of integrated common workflows. These smaller snakefiles which breakdown specific pipelines sections are in the following directories: - `[path_to_deeprvat]/pipeline/association_testing` contains snakefiles breaking down stages of the association testing. - `[path_to_deeprvat]/pipeline/cv_training` contains snakefiles used to run training in a cross-validation setup. - `[path_to_deeprvat]/pipeline/training` contains snakefiles used in setting up deepRVAT training.