# Seed gene discovery

This pipeline discovers *seed genes* for DeepRVAT training. The pipeline runs SKAT and burden tests for missense and pLOF variants, weighting variants with Beta(MAF,1,25). To run the tests, we use the `Scoretest` from the [SEAK](https://github.com/HealthML/seak) package (has to be installed from github).

To run the pipeline, an experiment directory with the `seed_gene_discovery_input_config.yaml` has to be created. See [example file](https://github.com/PMBio/deeprvat/blob/main/example/config/seed_gene_discovery_input_config.yaml). When the seed gene discovery pipeline is executed, a comprehensive `sg_discovery_config.yaml` file is automatically generated based on the `seed_gene_discovery_input_config.yaml` input.

(input-data)=
## Input data

The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](#input-data-formats), including
- `annotations.parquet`
- `protein_coding_genes.parquet`
- `genotypes.h5`
- `variants.parquet`
- `phenotypes.parquet`
- `seed_gene_discovery_input_config.yaml` (use [this](https://github.com/PMBio/deeprvat/blob/main/example/config/seed_gene_discovery_input_config.yaml) as a template)

The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences: `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`.

(configuration-file)=
## Configuration file

You can restrict to only missense variants (identified by the `Consequence_missense_variant` column in `annotations.parquet` ) or pLoF variants (`is_plof` column) via 
```
variant_types:
    - missense
    - plof
```
and specify the test types that will be run via 
```
test_types:
   - skat
   - burden
```

The minor allele frequency threshold is set via 

```
rare_maf: 0.001
```

You can specify further test details in the test config using the following parameters:

- `center_genotype` center the genotype matrix (True or False)
- `neglect_homozygous` Should the genotype value for homozyoogus variants be 1 (True) or 2 (False)
- `collapse_method` Burden test collapsing method. Supported are `sum` and `max`
- `var_weight` Variant weighting function. Supported are `beta_maf` (Beta(MAF, 1, 25)) or `sift_polpyen` (mean of 1-SIFT and Polyphen2 score)
- `min_mac` minimum expected allele count for genes to be included. This is the cumulative allele frequency of variants in the burden mask (e.g., pLoF variants) for a given gene (e.g. pLoF variants) multiplied by the cohort size or number of cases for quantitative and binary traits, respectively. 

```
test_config:
    center_genotype: True
    neglect_homozygous: False
    collapse_method: sum #collapsing method for burden, 
    var_weight_function: beta_maf 
    min_mac: 50 # minimum expected allele count

```

## Running the seed gene discovery pipeline

In a directory with all the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: 

```
[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile
```

Replace `[path_to_deeprvat]` with the path to your clone of the repository.