DeepRVAT annotation pipeline

This pipeline is based on snakemake. It uses bcftools + samstools, as well as perl, deepRiPe and deepSEA as well as VEP, including plugins for primateAI and spliceAI. DeepRiPe annotations were acquired using faatpipe repository by HealthML [1] and DeepSea annotations were calculated using kipoi-veff2 [2], abSplice scores were computed using abSplice [3]. The pipeline was tested on linux and macOS.

dag

Figure 1: Rule graph of the annotation pipeline.

Output

This pipeline outputs a parquet file including all annotations as well as a file containing IDs to all protein coding genes needed to run DeepRVAT. Apart from this the pipeline outputs a PCA transformation matrix for deepSEA as well as means and standard deviations used to standardize deepSEA scores before PCA analysis. This is helpful to recreate results using a different dataset. Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and AbSplice for each input vcf-file. The tool then creates concatenates the files, performs PCA on the deepSEA scores and merges the result into a single file.

Input

The pipeline uses left-normalized bcf files containing variant information (e.g. the bcf files created by the preprocessing pipeline) , a reference fasta file, and a gtf file for gene information. It is expected that the bcf files contain the columns “CHROM” “POS” “ID” “REF” and “ALT”. Any other columns, including genotype information, are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each “block” of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding config file. The pipeline also requires input data and repositories descried in requirements.

Requirements

BCFtools as well as HTSlib should be installed on the machine, VEP should be installed for running the pipeline. The faatpipe repo, kipoi-veff2 repo and vep-plugins repository should be cloned. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding config file. Download paths:

CADD: “All possible SNVs of GRCh38/hg38” and “gnomad.genomes.r3.0.indel.tsv.gz” incl. their Tabix Indices for CADD version 1.6
SpliceAI: “genome_scores_v1.3”/”spliceai_scores.raw.snv.hg38.vcf.gz” and “spliceai_scores.raw.indel.hg38.vcf.gz”
PrimateAI PrimateAI supplementary data/”PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz”
AlphaMissense

Further requirements:

A reference GTF file containing transcript annotations is required, this can be downloaded from here.
A file containing all genes, which deeprvat should consider together with a unique integer id for each gene. This file may be created manually by the user or automatically using the gtf file as input to create a gene id file for all protein coding genes. See here for more details.

Configure the annotation pipeline

The snakemake annotation pipeline is configured using a yaml file with the format akin to the example config file.

The config above would use the following directory structure:

|--reference
|   |-- fasta file
|   |-- GTF file 
|   |-- gene id file

|-- preprocessing_workdir
|   |-- norm
|   |   |-- bcf
|   |   |   |-- bcf_input_files
|   |   |   |-- ...
|   |   |-- variants
|   |   |   |-- variants.tsv.gz
|   |-- preprocessed
|   |   |-- genotypes.h5


|-- output_dir
|   |-- annotations
|   |   |-- tmp
|   |   |   |-- deepSEA_PCA
|   |   |   |-- absplice

|-- repo_dir
|   |-- ensembl-vep
|   |   |-- cache
|   |   |-- plugins
|   |-- faatpipe
|   |-- kipoi-veff2

|-- annotation_data
|   |-- cadd
|   |-- spliceAI
|   |-- primateAI

Bcf files created by the preprocessing pipeline are used as input data. The input data directory should only contain the files needed. The pipeline also uses the variant.tsv file, the reference file and the genotypes file from the preprocessing pipeline. A GTF file as described in requirements and the FASTA file used for preprocessing is also necessary. The output is stored in the output_dir/annotations folder and any temporary files in the tmp subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in repo_dir. Data for VEP plugins and the CADD cache are stored in annotation_data.

Running the annotation pipeline on example data

If you haven’t done so already, run the preprocessing pipeline
change into the example/annotations directory

Install the annotation environment

mamba env create -f path/to/deeprvat/deeprvat_annotations.yml
mamba activate deeprvat_annotations
pip install -e path/to/deeprvat

Clone/Link the repositories mentioned in requirements into repo_dir and install the required kipoi-veff2 conda environment with
```
mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
```
If you already have some of the needed repositories on your machine you can edit the paths in the config.
Inside the annotation directory create a directory annotation_data and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see requirements)

Run the annotations pipeline inside the deeprvat_annotations environment:

  mamba activate deeprvat_annotations
  snakemake -j <nr_cores> -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

Running the pipeline on your own data

Modify the path in the config file s.t. they point to the output directory of the preprocessing pipeline run on your data.

Configuring the annotation pipeline

You can add/remove VEP plugins in the additional_vep_plugin_cmds part of the config by adding /removing plugin commands to be added to the vep run command. You can omit abSplice and/or deepSea by setting include_absplice/ include_deepSEA to Falsein the config. When you add/remove annotations you have to alter the values in example/config/annotation_colnames_filling_values.yaml. This file consist of the names of the columns of the tool used, the name to be used in the output data frame, the default value replacing all NA values as well as the data type, for example:

  'CADD_RAW' : 
    - 'CADD_raw'
    - 0
    - float

Here CADD_RAW is the name of the column of the VEP output when the plugin is used, it is then renamed in the final annotation data frame to CADD_raw, all NA values are set to 0 and the values are of type float.

You can also modify the example/config/annotation_colnames_filling_values.yaml file to choose custom filling values for each of the annotations. For each of the annotations the second value represents the value to use to fill in NA values, i.e. in the example above, in the CADD_raw column NA values are filled using 0. If you want to keep a copy of the annotations data before any NA values are filled, you can add

keep_unfilled: True

to the config file.

You can also change the way the allele frequencies are calculated by adding af_mode key to the config file. By default, the allele frequencies are calculated from the data the annotation pipeline is run with. To use gnomade or gnomadg allele frequencies (from VEP ) instead, add

af_mode : 'af_gnomade'

or

af_mode : 'af_gnomadg'

to the config file.

Gene id file

As mentioned in the requirements section, the pipeline expects a parquet file containing all genes that deeprvat should consider, together with a unique integer id for each gene. This file can be created automatically using a GTF file as input. The output is then a parquet file in the expected format containing all protein coding genes of the provided GTF file. To automatically create the gene id file, make sure the annotation environment (mentioned here ) is active and run

deeprvat_annotations create-gene-id-file deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz deeprvat/example/annotations/reference/protein_coding_genes.parquet

with deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz pointing to any downloaded GTF file and deeprvat/example/annotations/reference/protein_coding_genes.parquet pointing to the desired output path, which has to be specified in the config file.

Alternatively, when the user want to select a specific set of genes to consider, the gene id file may be created by the user. The file is expected to have two columns:

columngene:str name for each gene
column id:int unique id for each gene Each row represents a gene the user want to include in the analysis.

Manually adding precomputed and downloaded AbSplice scores

Instead of running the absplice part of the annotation pipeline, users can include include_absplice: false in the annotation config file and manually add Absplice annotations after the pipeline ran through. To do this, after downloading and unpacking the precomputed absplice scores from zenodo, and running the annotation pipeline without absplice, users can run deeprvat_annotations aggregate_and_concat_absplice inside the deeprvat_annotations environment with --absplice-dir pointing to the unpacked directory containing the absplice scores and --ab-splice-agg-score-file pointing to the path where the intermediate aggregated absplice scores should be saved. Run deeprvat_annotations merge_absplice_scores afterwards with the annotations, variants and protein_coding_genes.parquet as well as the aggregated absplice scores as input. Finally state where you want the output annotation file stored (--output).

References

[1] Monti, R., Rautenstrauch, P., Ghanbari, M. et al. Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes. Nat Commun 13, 5332 (2022). https://doi.org/10.1038/s41467-022-32864-2

[2] Žiga Avsec et al., “Kipoi: accelerating the community exchange and reuse of predictive models for genomics,” bioRxiv, p. 375345, Jan. 2018, doi: 10.1101/375345.

[3]N. Wagner et al., “Aberrant splicing prediction across human tissues,” Nature Genetics, vol. 55, no. 5, pp. 861–870, May 2023, doi: 10.1038/s41588-023-01373-3.