DeepRVAT preprocessing pipeline
The DeepRVAT preprocessing pipeline is based on snakemake. It uses bcftools+samtools and a python script preprocessing.py.
Output
The important files that this pipeline produces that are needed in DeepRVAT are:
preprocessed/genotypes.h5 The main sparse hdf5 file
norm/variants/variants.parquet List of variants in parquet format
Setup environment
Create the DeepRVAT processing environment
Clone this repository:
git clone git@github.com:PMBio/deeprvat.git
Change directory to the repository: cd deeprvat
mamba env create --file deeprvat_preprocessing_env.yml
Activate the environment
mamba activate deeprvat_preprocess
Install DeepRVAT in the environment
pip install -e .
Configure preprocessing
The snakemake preprocessing is configured using a yaml file with the format below. An example file is included in this repo: example config.
# What chromosomes should be processed
included_chromosomes : [21,22]
# The format of the name of the "raw" vcf files
vcf_files_list: vcf_files_list.txt
# If you need to run a cmd to load bcf and samtools specify it here, see example
bcftools_load_cmd : # module load bcftools/1.10.2 &&
samtools_load_cmd : # module load samtools/1.9 &&
# Path to where you want to write results and intermediate data
working_dir: workdir
# These paths are all relative to the working dir
# Here will the finished preprocessed files end up
preprocessed_dir_name : preprocessed
# Path to directory with fasta reference file
reference_dir_name : reference
# Here we will store normalized bcf files
norm_dir_name : norm
# Here we store "sparsified" bcf files
sparse_dir_name : sparse
# Expected to be found in working_dir/reference_dir
reference_fasta_file : GRCh38.primary_assembly.genome.fa
# You can specify a different zcat cmd for example gzcat here, default zcat
zcat_cmd:
The config above would use the following directory structure:
parent_directory
`-- workdir
|-- norm
| |-- bcf
| |-- sparse
| `-- variants
|-- preprocessed
|-- qc
| |-- allelic_imbalance
| |-- duplicate_vars
| |-- filtered_samples
| |-- hwe
| |-- indmiss
| | |-- samples
| | |-- sites
| | `-- stats
| |-- read_depth
| `-- varmiss
`-- reference
vcf_files_list
The vcf_files_list variable specifies the path to a text file that contains paths to the raw vcf files you want to
process.
ex:
data/vcf/test_vcf_data_c21_b1.vcf.gz
data/vcf/test_vcf_data_c22_b1.vcf.gz
The easiest way to create vcf_files_list (if you have your files in data/vcf under the parent_directory)
cd <parent_directory>
find data/vcf -type f -name "*.vcf*" > vcf_files_list.txt
Running the preprocess pipeline
There are two versions of the pipeline, one with qc (quality control) and one without, the version with qc is the one we used when we wrote the paper. The qc is specific to the UKBB data, so if you want/need to do your own qc use the pipeline without qc.
Run the preprocess pipeline with example data and qc
The vcf files in the example data folder was generated using fake-vcf (with some manual editing). hence does not contain real data.
cd into the preprocessing example dir
cd <path_to_repo>
cd example/preprocess
Download the fasta file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference
Unpack the fasta file
gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
Run with the example config
snakemake -j 1 --snakefile ../../pipelines/preprocess_with_qc.snakefile --configfile ../config/deeprvat_preprocess_config.yaml
Enjoy the preprocessed data 🎉
ls -l workdir/preprocessed
total 48
-rw-r--r-- 1 user staff 6404 Aug 2 14:06 genotypes.h5
-rw-r--r-- 1 user staff 6354 Aug 2 14:06 genotypes_chr21.h5
-rw-r--r-- 1 user staff 6354 Aug 2 14:06 genotypes_chr22.h5
Run the preprocess pipeline with example data and no qc
The vcf files in the example data folder was generated using fake-vcf (with some manual editing). hence does not contain real data.
cd into the preprocessing example dir
cd <path_to_repo>
cd example/preprocess
Download the fasta file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference
Unpack the fasta file
gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
Run with the example config
snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../example/config/deeprvat_preprocess_config.yaml
Enjoy the preprocessed data 🎉
ls -l workdir/preprocessed
total 48
-rw-r--r-- 1 user staff 6404 Aug 2 14:06 genotypes.h5
-rw-r--r-- 1 user staff 6354 Aug 2 14:06 genotypes_chr21.h5
-rw-r--r-- 1 user staff 6354 Aug 2 14:06 genotypes_chr22.h5
Run on your own data with qc
After configuration and activating the environment run the pipeline using snakemake:
snakemake -j<nr_cores> --configfile config/deeprvat_preprocess_config.yaml -s preprocess_with_qc.snakefile
Run on your own data without qc
After configuration and activating the environment run the pipeline using snakemake:
snakemake -j<nr_cores> --configfile config/deeprvat_preprocess_config.yaml -s preprocess_no_qc.snakefile