DeepRVAT preprocessing pipeline

The DeepRVAT preprocessing pipeline is based on snakemake. It uses bcftools+samtools and a python script preprocessing.py.

DeepRVAT preprocessing pipeline

Output

The important files that this pipeline produces that are needed in DeepRVAT are:

  • preprocessed/genotypes.h5 The main sparse hdf5 file

  • norm/variants/variants.parquet List of variants in parquet format

Setup environment

Create the DeepRVAT processing environment

Clone this repository:

git clone git@github.com:PMBio/deeprvat.git

Change directory to the repository: cd deeprvat

mamba env create --file deeprvat_preprocessing_env.yml

Activate the environment

mamba activate deeprvat_preprocess

Install DeepRVAT in the environment

pip install -e .

Configure preprocessing

The snakemake preprocessing is configured using a yaml file with the format below. An example file is included in this repo: example config.

# What chromosomes should be processed
included_chromosomes : [21,22]

# The format of the name of the "raw" vcf files
vcf_files_list: vcf_files_list.txt

# If you need to run a cmd to load bcf and samtools specify it here, see example
bcftools_load_cmd : # module load bcftools/1.10.2 &&
samtools_load_cmd : # module load samtools/1.9 &&

# Path to where you want to write results and intermediate data
working_dir: workdir

# These paths are all relative to the working dir
# Here will the finished preprocessed files end up
preprocessed_dir_name : preprocessed
# Path to directory with fasta reference file
reference_dir_name : reference
# Here we will store normalized bcf files
norm_dir_name : norm
# Here we store "sparsified" bcf files
sparse_dir_name : sparse

# Expected to be found in working_dir/reference_dir
reference_fasta_file : GRCh38.primary_assembly.genome.fa

# You can specify a different zcat cmd for example gzcat here, default zcat
zcat_cmd:

The config above would use the following directory structure:

parent_directory
`-- workdir
    |-- norm
    |   |-- bcf
    |   |-- sparse
    |   `-- variants
    |-- preprocessed
    |-- qc
    |   |-- allelic_imbalance
    |   |-- duplicate_vars
    |   |-- filtered_samples
    |   |-- hwe
    |   |-- indmiss
    |   |   |-- samples
    |   |   |-- sites
    |   |   `-- stats
    |   |-- read_depth
    |   `-- varmiss
    `-- reference

vcf_files_list

The vcf_files_list variable specifies the path to a text file that contains paths to the raw vcf files you want to process.

ex:

data/vcf/test_vcf_data_c21_b1.vcf.gz
data/vcf/test_vcf_data_c22_b1.vcf.gz

The easiest way to create vcf_files_list (if you have your files in data/vcf under the parent_directory)

cd <parent_directory>
find data/vcf -type f -name "*.vcf*" > vcf_files_list.txt

Running the preprocess pipeline

There are two versions of the pipeline, one with qc (quality control) and one without, the version with qc is the one we used when we wrote the paper. The qc is specific to the UKBB data, so if you want/need to do your own qc use the pipeline without qc.

Run the preprocess pipeline with example data and qc

DeepRVAT preprocessing pipeline

The vcf files in the example data folder was generated using fake-vcf (with some manual editing). hence does not contain real data.

  1. cd into the preprocessing example dir

cd <path_to_repo>
cd example/preprocess
  1. Download the fasta file

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference
  1. Unpack the fasta file

gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
  1. Run with the example config

snakemake -j 1 --snakefile ../../pipelines/preprocess_with_qc.snakefile --configfile ../config/deeprvat_preprocess_config.yaml
  1. Enjoy the preprocessed data 🎉

ls -l workdir/preprocessed
total 48
-rw-r--r--  1 user  staff  6404 Aug  2 14:06 genotypes.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr21.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr22.h5

Run the preprocess pipeline with example data and no qc

DeepRVAT preprocessing pipeline

The vcf files in the example data folder was generated using fake-vcf (with some manual editing). hence does not contain real data.

  1. cd into the preprocessing example dir

cd <path_to_repo>
cd example/preprocess
  1. Download the fasta file

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference
  1. Unpack the fasta file

gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
  1. Run with the example config

snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../example/config/deeprvat_preprocess_config.yaml
  1. Enjoy the preprocessed data 🎉

ls -l workdir/preprocessed
total 48
-rw-r--r--  1 user  staff  6404 Aug  2 14:06 genotypes.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr21.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr22.h5

Run on your own data with qc

After configuration and activating the environment run the pipeline using snakemake:

snakemake -j<nr_cores> --configfile config/deeprvat_preprocess_config.yaml -s preprocess_with_qc.snakefile  

Run on your own data without qc

After configuration and activating the environment run the pipeline using snakemake:

snakemake -j<nr_cores> --configfile config/deeprvat_preprocess_config.yaml -s preprocess_no_qc.snakefile