deeprvat.annotations.annotations

Module Contents

Functions

precision

Calculate precision, a metric for the accuracy of the positive predictions.

recall

Calculate recall, a metric for the ability to capture true positive instances.

deepripe_get_model_info

Retrieve information about the paths and names of saved deepRiPe models.

seq_to_1hot

Convert a nucleotide sequence to one-hot encoding.

convert2bed

Convert a variants file to BED format.

deepripe_encode_variant_bedline

Encode a variant bedline into one-hot encoded sequences.

readYamlColumns

get_parquet_columns

cli

filter_annotations_by_exon_distance

Filters annotation based on distance to the nearest exon of gene it is associated with.

deepsea_pca

Perform Principal Component Analysis (PCA) on DeepSEA data and save the results.

scorevariants_deepripe

Score variants using deep learning models trained on PAR-CLIP and eCLIP data.

process_chunk

Process a chunk of data from absplice site results and merge it with the remaining annotation data.

aggregate_abscores

Aggregate AbSplice scores from AbSplice results and save the results.

deepripe_score_variant_onlyseq_all

Compute variant scores using a deep learning model for each specified variant.

calculate_scores_max

merge_abscores

Merge AbSplice scores with the current annotation file and save the result.

merge_deepripe

Merge deepRiPe scores with an annotation file and save the result.

merge_deepsea_pcas

Merge deepRiPe PCA scores with an annotation file and save the result.

process_chunk_addids

Process a chunk of data by adding identifiers from a variants dataframe.

add_ids

Add identifiers from a variant file to an annotation file and save the result.

add_ids_dask

Add identifiers from a variant file to an annotation file using Dask and save the result.

chunks

Split a list into chunks of size ‘n’.

read_deepripe_file

Read a DeepRipe file from the specified path.

concatenate_deepsea

Concatenate DeepSEA files based on the provided patterns and chromosome blocks.

merge_annotations

Merge VEP, DeepRipe (parclip, hg2, k5), and variant files into one dataFrame and save result as parquet file

process_deepripe

Process the DeepRipe DataFrame, rename columns and drop duplicates.

process_vep

Process the VEP DataFrame, extracting relevant columns and handling data types.

compute_plof

Cumputes and adds plof column based on plof function.

concat_annotations

Concatenate multiple annotation files based on the specified pattern and create a single output file.

get_af_from_gt

Compute allele frequencies from genotype data.

merge_af

Merge allele frequency data into annotations and save to a file.

calculate_maf

Calculate minor allele frequency (MAF) from allele frequency data in annotations.

add_gene_ids

Add gene IDs to the annotations based on gene ID mapping file.

create_gene_id_file

Create a protein ID mapping file from the GTF file.

select_rename_fill_annotations

Select, rename, and fill missing values in annotation columns based on a YAML configuration file.

aggregate_and_concat_absplice

Aggregates AbSplice scores from multiple files in a directory and saves them to a single Parquet file. This function performs the following steps: 1. Iterates through all files in the specified directory. 2. Reads each file and aggregates the AbSplice scores using the specified aggregation function. 3. Saves the final aggregated scores to a specified Parquet file.

merge_absplice_scores

Processes and merges sbsplice scores to annotation parquet file

Data

logger

API

deeprvat.annotations.annotations.precision(y_true, y_pred)

Calculate precision, a metric for the accuracy of the positive predictions.

Precision is defined as the the fraction of relevant instances among the retrieved instances.

Parameters: - y_true (Tensor): The true labels (ground truth). - y_pred (Tensor): The predicted labels.

Returns: float: Precision value.

Notes: - This function uses the Keras backend functions to perform calculations. - Precision is calculated as true_positives / (predicted_positives + epsilon), where epsilon is a small constant to avoid division by zero.

References: - https://en.wikipedia.org/wiki/Precision_and_recall

deeprvat.annotations.annotations.recall(y_true, y_pred)

Calculate recall, a metric for the ability to capture true positive instances.

Recall is defined as the fraction of relevant instances that were retrieved.

Parameters: - y_true (Tensor): The true labels (ground truth). - y_pred (Tensor): The predicted labels.

Returns: - float: Recall value.

Notes: - This function uses the Keras backend functions to perform calculations. - Recall is calculated as true_positives / (possible_positives + epsilon), where epsilon is a small constant to avoid division by zero.

References: - https://en.wikipedia.org/wiki/Precision_and_recall

deeprvat.annotations.annotations.deepripe_get_model_info(saved_models_dict, saved_deepripe_models_path)

Retrieve information about the paths and names of saved deepRiPe models.

Parameters: - saved_models_dict (dict): A dictionary containing keys for different types of models. Keys include “parclip” for PAR-CLIP models, “eclip_hg2” for eCLIP models in HepG2, and “eclip_k5” for eCLIP models in K562. Values are model identifiers. - saved_deepripe_models_path (str): The path to the directory where the deepRiPe models are saved.

Returns: tuple: A tuple containing two dictionaries. The first dictionary contains paths for each type of model, with keys “parclip”, “eclip_hg2”, and “eclip_k5” and values as lists of paths corresponding to high, medium, and low sequence models. The second dictionary contains lists of RBP names for each type of model, with keys “parclip”, “eclip_hg2”, and “eclip_k5” and values as lists of RBP names for high, medium, and low sequence models.

Notes: - The function constructs file paths based on the provided model identifiers. - The resulting dictionary structure allows easy access to model paths for different types.

deeprvat.annotations.annotations.seq_to_1hot(seq, randomsel=True)

Convert a nucleotide sequence to one-hot encoding.

Parameters: - seq (str): The input nucleotide sequence. - randomsel (bool): If True, treat ambiguous base as random base. If False, return only zero rows for ambiguous case.

Returns: numpy.ndarray: A 2D array representing the one-hot encoding of the input sequence. Rows correspond to nucleotides ‘A’, ‘C’, ‘G’, ‘T’ in that order. Columns correspond to positions in the input sequence.

Notes: - Ambiguous bases are handled based on the ‘randomsel’ parameter.

References: - one-hot encoding: https://en.wikipedia.org/wiki/One-hot

deeprvat.annotations.annotations.convert2bed(variants_file, output_dir)

Convert a variants file to BED format.

Parameters: - variants_file (str): The path to the input variants file. - output_dir (str): The directory where the BED file will be saved.

Returns: None

Notes: - The input variants file should be in tab-separated format with columns: “#CHROM”, “POS”, “ID”, “REF”, “ALT”. - The generated BED file will have columns: “CHR”, “Start”, “End”, “ID”, “VAR”, “Strand”. - The “Start” and “End” columns are set to the “POS” values, and “Strand” is set to ‘.’ for all entries.

deeprvat.annotations.annotations.deepripe_encode_variant_bedline(bedline, genomefasta, flank_size=75)

Encode a variant bedline into one-hot encoded sequences.

Parameters: - bedline (list): A list representing a variant bedline, containing elements for chromosome, start position, end position, reference allele, alternate allele, and strand. - genomefasta (str): The path to the genome FASTA file for sequence retrieval. - flank_size (int): The size of flanking regions to include in the sequence around the variant position.

Returns: numpy.ndarray: A 3D array representing one-hot encoded sequences. The dimensions are (num_sequences, sequence_length, nucleotide_channels).

Notes: - The input bedline should follow the format: [chromosome, start position, end position, reference allele, alternate allele, strand]. - The function retrieves the wild-type and mutant sequences flanked by the specified size. - The wild-type sequence is extracted from the genome FASTA file and mutated at the variant position. - The resulting sequences are one-hot encoded and returned as a numpy array.

References: - pybedtools.BedTool: https://daler.github.io/pybedtools/main.html - FATSA format: https://en.wikipedia.org/wiki/FASTA_format

deeprvat.annotations.annotations.readYamlColumns(annotation_columns_yaml_file)
deeprvat.annotations.annotations.get_parquet_columns(parquet_file)
deeprvat.annotations.annotations.cli()
deeprvat.annotations.annotations.filter_annotations_by_exon_distance(anno_path: str, gtf_path: str, genes_path: str, output_path: str, max_dist: int) None

Filters annotation based on distance to the nearest exon of gene it is associated with.

Args:

anno_path (str): Annotation parquet file containing variant annotations to filter. gtf_path (str): GTF file containing start and end positions of all relevant exons of all relevant genes. DataFrame is filtered for protein coding exons. genes_path (str): List of protein coding genes and their IDs in the annotation DataFrame. output_path (str): Where to write the resulting parquet file. max_dist (int): Base pairs used to filter.

Returns:

None

Writes:

Parquet file containing filtered annotations.

deeprvat.annotations.annotations.deepsea_pca(deepsea_file: str, pca_object: str, means_sd_df: str, out_dir: str, n_components: int)

Perform Principal Component Analysis (PCA) on DeepSEA data and save the results.

Parameters: - n_components (int): Number of principal components to retain, default is 100. - deepsea_file (str): Path to the DeepSEA data in parquet format. - pca_object (str): Path to save or load the PCA object (components) in npy or pickle format. - means_sd_df (str): Path to a DataFrame containing pre-calculated means and SDs for standardization. If path does not exist, standardization will be done using the calculated mean and SD, result will then be saved under this path - out_dir (str): Path to the output directory where the PCA results will be saved.

Returns: None

Raises: AssertionError: If there are NaN values in the PCA results DataFrame.

Notes: - If ‘means_sd_df’ is provided, the data will be standardized using the existing mean and SD. Otherwise, the data will be standardized using the mean and SD calculated from the data. - If ‘pca_object’ exists, it will be loaded as a PCA object. If it doesn’t exist, a new PCA object will be created, and its components will be saved to ‘pca_object’.

Example: $ python annotations.py deepsea_pca –n-components 50 deepsea_data.parquet pca_components.npy means_sd.parquet results/

deeprvat.annotations.annotations.scorevariants_deepripe(variants_file: str, output_dir: str, genomefasta: str, pybedtools_tmp_dir: str, saved_deepripe_models_path: str, n_jobs: int, saved_model_type: str = 'parclip')

Score variants using deep learning models trained on PAR-CLIP and eCLIP data.

Parameters: - variants_file (str): Path to the file containing variant information to be annotated. - output_dir (str): Path to the output directory where the results will be saved. - genomefasta (str): Path to the reference genome in FASTA format. - pybedtools_tmp_dir (str): Path to the temporary directory for pybedtools. - saved_deepripe_models_path (str): Path to the directory containing saved deepRiPe models. - n_jobs (int): Number of parallel jobs for scoring variants. - saved_model_type (str, optional): Type of the saved deepRiPe model to use (parclip, eclip_hg2, eclip_k5). Default is “parclip”.

Returns: None

Raises: AssertionError: If there are NaN values in the generated DataFrame.

Notes: - This function scores variants using deepRiPe models trained on different CLIP-seq datasets. - The results are saved as a CSV file in the specified output directory.

Example: $ python annotations.py scorevariants_deepripe variants.csv results/ reference.fasta tmp_dir/ saved_models/ 8 eclip_k5

deeprvat.annotations.annotations.process_chunk(chrom_file, abs_splice_res_dir, tissues_to_exclude, tissue_agg_function, ca_shortened)

Process a chunk of data from absplice site results and merge it with the remaining annotation data.

Parameters: - chrom_file (str): The filename for the chunk of absplice site results. - abs_splice_res_dir (Path): The directory containing the absplice site results. - tissues_to_exclude (list): List of tissues to exclude from the absplice site results. - tissue_agg_function (str): The aggregation function to use for tissue-specific AbSplice scores. - ca_shortened (DataFrame): The remaining annotation data to merge with the absplice site results.

Returns: DataFrame: Merged DataFrame containing aggregated tissue-specific AbSplice scores and remaining annotation data.

Notes: - The function reads the absplice site results for a specific chromosome, excludes specified tissues, and aggregates AbSplice scores using the specified tissue aggregation function. - The resulting DataFrame is merged with the remaining annotation data based on the chromosome, position, reference allele, alternative allele, and gene ID.

Example: merged_data = process_chunk(“chr1_results.csv”, Path(“abs_splice_results/”), [“Brain”, “Heart”], “max”, ca_shortened_df)

deeprvat.annotations.annotations.aggregate_abscores(current_annotation_file: str, abs_splice_res_dir: str, absplice_score_file: str, njobs: int)

Aggregate AbSplice scores from AbSplice results and save the results.

Parameters: - current_annotation_file (str): Path to the current annotation file in parquet format. - abs_splice_res_dir (str): Path to the directory containing AbSplice results. - absplice_score_file (str): Path to save the aggregated AbSplice scores in parquet format. - njobs (int): Number of parallel jobs for processing AbSplice results.

Returns: None

Notes: - The function reads the current annotation file and extracts necessary information for merging. - It then processes AbSplice results in parallel chunks, aggregating AbSplice scores. - The aggregated scores are saved to the specified file.

Example: $ python annotations.py aggregate_abscores annotations.parquet abs_splice_results/ absplice_scores.parquet 4

deeprvat.annotations.annotations.logger = 'getLogger(...)'
deeprvat.annotations.annotations.deepripe_score_variant_onlyseq_all(model_group, variant_bed, genomefasta, seq_len=200, batch_size=1024, n_jobs=32)

Compute variant scores using a deep learning model for each specified variant.

Parameters:
  • model_group (dict): A dictionary containing deep learning models for different choices. Each entry should be a key-value pair, where the key is the choice name and the value is a tuple containing the model and additional information.

  • variant_bed (list): A list of variant bedlines, where each bedline represents a variant.

  • genomefasta (str): Path to the reference genome in FASTA format.

  • seq_len (int, optional): The length of the sequence to use around each variant. Default is 200.

  • batch_size (int, optional): Batch size for parallelization. Default is 1024.

  • n_jobs (int, optional): Number of parallel jobs for processing variant bedlines. Default is 32.

Returns:
dict: A dictionary containing variant scores for each choice in the model_group.

Each entry has the choice name as the key and the corresponding scores as the value.

deeprvat.annotations.annotations.calculate_scores_max(scores)
deeprvat.annotations.annotations.merge_abscores(current_annotation_file: str, absplice_score_file: str, out_file: str)

Merge AbSplice scores with the current annotation file and save the result.

Parameters: - current_annotation_file (str): Path to the current annotation file in parquet format. - absplice_score_file (str): Path to the AbSplice scores file in parquet format. - out_file (str): Path to save the merged annotation file with AbSplice scores.

Returns: None

Notes: - The function reads AbSplice scores and the current annotation file. - It merges the AbSplice scores with the current annotation file based on chromosome, position, reference allele, alternative allele, and gene ID. - The merged file is saved with AbSplice scores.

Example: $ python annotations.py merge_abscores current_annotation.parquet absplice_scores.parquet merged_annotations.parquet

deeprvat.annotations.annotations.merge_deepripe(annotation_file: str, deepripe_file: str, out_file: str, column_prefix: str)

Merge deepRiPe scores with an annotation file and save the result.

Parameters: - annotation_file (str): Path to the annotation file in parquet format. - deepripe_file (str): Path to the deepRiPe scores file in CSV format. - out_file (str): Path to save the merged file with deepRiPe scores. - column_prefix (str): Prefix to add to the deepRiPe score columns in the merged file.

Returns: None

Notes: - The function reads the annotation file and deepRiPe scores file. - It renames the columns in the deepRiPe scores file with the specified prefix. - The two dataframes are merged based on chromosome, position, reference allele, alternative allele, and variant ID. - The merged file is saved with deepRiPe scores.

Example: $ python annotations.py merge_deepripe annotations.parquet deepripe_scores.csv merged_deepripe.parquet deepripe

deeprvat.annotations.annotations.merge_deepsea_pcas(annotation_file: str, deepripe_pca_file: str, column_yaml_file: str, out_file: str)

Merge deepRiPe PCA scores with an annotation file and save the result.

Parameters: - annotation_file (str): Path to the annotation file in parquet format. - deepripe_pca_file (str): Path to the deepRiPe PCA scores file in parquet format. - column_yaml_file(str): Path to the yaml file containing all needed columns for the model, including their filling values. - out_file (str): Path to save the merged file with deepRiPe PCA scores.

Returns: None

Notes: - The function reads the annotation file and deepRiPe PCA scores file. - It drops duplicates in both files based on chromosome, position, reference allele, alternative allele, variant ID, and gene ID. - The two dataframes are merged based on chromosome, position, reference allele, alternative allele, and variant ID. - The merged file is saved with deepRiPe PCA scores.

Example: $ python annotations.py merge_deepsea_pcas annotations.parquet deepripe_pca_scores.parquet merged_deepsea_pcas.parquet

deeprvat.annotations.annotations.process_chunk_addids(chunk: pandas.DataFrame, variants: pandas.DataFrame) pandas.DataFrame

Process a chunk of data by adding identifiers from a variants dataframe.

Parameters: - chunk (pd.DataFrame): Chunk of data containing variant information. - variants (pd.DataFrame): Dataframe containing variant identifiers.

Returns: pd.DataFrame: Processed chunk with added variant identifiers.

Raises: AssertionError: If the shape of the processed chunk does not match expectations.

Notes: - The function renames columns for compatibility. - Drops duplicates in the chunk based on the key columns. - Merges the chunk with the variants dataframe based on the key columns. - Performs assertions to ensure the shape of the processed chunk meets expectations.

Example: `python chunk = pd.read_csv("chunk_data.csv") variants = pd.read_csv("variants_data.csv") processed_chunk = process_chunk_addids(chunk, variants) `

deeprvat.annotations.annotations.add_ids(annotation_file: str, variant_file: str, njobs: int, out_file: str)

Add identifiers from a variant file to an annotation file and save the result.

Parameters: - annotation_file (str): Path to the input annotation file in CSV format. - variant_file (str): Path to the input variant file in TSV format. - njobs (int): Number of parallel jobs to process the data. - out_file (str): Path to save the processed data in Parquet format.

Returns: None

Notes: - The function reads the annotation file in chunks and the entire variant file. - It uses parallel processing to apply the ‘process_chunk_addids’ function to each chunk. - The result is saved in Parquet format.

Example: $ python annotations.py add_ids annotation_data.csv variant_data.tsv 4 processed_data.parquet

deeprvat.annotations.annotations.add_ids_dask(annotation_file: str, variant_file: str, out_file: str)

Add identifiers from a variant file to an annotation file using Dask and save the result.

Parameters: - annotation_file (str): Path to the input annotation file in Parquet format. - variant_file (str): Path to the input variant file in Parquet format. - out_file (str): Path to save the processed data in Parquet format.

Returns: None

Notes: - The function uses Dask to read annotation and variant files with large block sizes. - It renames columns for compatibility and drops duplicates based on key columns. - Merges the Dask dataframes using the ‘merge’ function. - The result is saved in Parquet format with compression.

Example: $ python annotations.py add_ids_dask annotation_data.parquet variant_data.parquet 4 processed_data.parquet

deeprvat.annotations.annotations.chunks(lst, n)

Split a list into chunks of size ‘n’.

Parameters: - lst (list): The input list to be split into chunks. - n (int): The size of each chunk.

Yields: list: A chunk of the input list.

deeprvat.annotations.annotations.read_deepripe_file(f: str)

Read a DeepRipe file from the specified path.

Parameters: - f (str): Path to the DeepRipe file.

Returns: pd.DataFrame: DataFrame containing the data from the DeepRipe file.

Example: `python file_path = "path/to/deepripe_file.txt" deepripe_data = read_deepripe_file(file_path) `

deeprvat.annotations.annotations.concatenate_deepsea(deepsea_files: str, out_file: str, njobs: int)

Concatenate DeepSEA files based on the provided patterns and chromosome blocks.

Parameters: - deepSEA_name_pattern (str): comma-separated list of deepsea files to concatenate - out_file (str): Path to save the concatenated output file in Parquet format. - njobs (int): Number of parallel jobs for processing.

Returns: None

Example: $ python annotations.py concatenate_deepSEA chr1_block0.CLI.deepseapredict.diff.tsv,chr1_block1.CLI.deepseapredict.diff.tsv,chr1_block2.CLI.deepseapredict.diff.tsv concatenated_output.parquet 4

deeprvat.annotations.annotations.merge_annotations(vep_header_line: int, vep_file: str, deepripe_parclip_file: str, deepripe_hg2_file: str, deepripe_k5_file: str, variant_file: str, vcf_file: str, out_file: str, column_yaml: str)

Merge VEP, DeepRipe (parclip, hg2, k5), and variant files into one dataFrame and save result as parquet file

Parameters: - vep_header_line (int): Line number of the header line in the VEP output file. - vep_file (str): Path to the VEP file. - deepripe_parclip_file (str): Path to the DeepRipe parclip file. - deepripe_hg2_file (str): Path to the DeepRipe hg2 file. - deepripe_k5_file (str): Path to the DeepRipe k5 file. - variant_file (str): Path to the variant file. - vcf_file (str): vcf file containing chrom, pos, ref and alt information - out_file (str): Path to save the merged output file in Parquet format. - column yaml file

Returns: None

Example: $ python annotations.py merge_annotations 1 vep_file.tsv deepripe_parclip.csv deepripe_hg2.csv deepripe_k5.csv variant_file.tsv merged_output.parquet –vepcols_to_retain=”AlphaMissense,PolyPhen”

deeprvat.annotations.annotations.process_deepripe(deepripe_df: pandas.DataFrame, column_prefix: str) pandas.DataFrame

Process the DeepRipe DataFrame, rename columns and drop duplicates.

Parameters: - deepripe_df (pd.DataFrame): DataFrame containing DeepRipe data. - column_prefix (str): Prefix to be added to column names.

Returns: pd.DataFrame: Processed DeepRipe DataFrame.

Example: deepripe_df = process_deepripe(deepripe_df, “parclip”)

deeprvat.annotations.annotations.process_vep(vep_file: pandas.DataFrame, vcf_file: str, types_mapping: dict) pandas.DataFrame

Process the VEP DataFrame, extracting relevant columns and handling data types.

Parameters: - vep_file (pd.DataFrame): DataFrame containing VEP data. - types_mapping (dict): List of columns to retain as keys and their corresponding type as values.

Returns: pd.DataFrame: Processed VEP DataFrame.

Example: vep_file = process_vep(vep_file, vepcols_to_retain=[“additional_col1”, “additional_col2”])

deeprvat.annotations.annotations.compute_plof(anno_df_in, anno_df_out)

Cumputes and adds plof column based on plof function.

Parameters: - anno_df_in(str): File path of annotation file to read in - anno_df_out(str): File path of output file

Returns: None

Example: deeprvat_annotations compute_plof annotations.parquet annotations_plof.parquet

deeprvat.annotations.annotations.concat_annotations(filenames: str, out_file: str)

Concatenate multiple annotation files based on the specified pattern and create a single output file.

Parameters: - filenames (str): File paths for annotation files to concatenate - out_file (str): Output file path.

Returns: None

Example: concat_annotations “annotations/chr1_block0_merged.parquet,annotations/chr1_block1_merged.parquet,annotations/chr1_block2_merged.parquet “ “output.parquet”)

deeprvat.annotations.annotations.get_af_from_gt(genotype_file: str, variants_filepath: str, out_file: str)

Compute allele frequencies from genotype data.

Parameters: - genotype_file (str): Path to the genotype file. - variants_filepath (str): Path to the variants file. - out_file (str): Output file path for storing allele frequencies.

deeprvat.annotations.annotations.merge_af(annotations_path: str, af_df_path: str, out_file: str)

Merge allele frequency data into annotations and save to a file.

Parameters: - annotations_path (str): Path to the annotations file. - af_df_path (str): Path to the allele frequency DataFrame file. - out_file (str): Path to the output file to save merged data.

deeprvat.annotations.annotations.calculate_maf(annotations_path: str, out_file: str, af_column_name: str)

Calculate minor allele frequency (MAF) from allele frequency data in annotations.

Parameters: - af-column-name(str): Name of the allele frequency column to calculate MAF from - annotations_path (str): Path to the annotations file containing allele frequency data. - out_file (str): Path to the output file to save the calculated MAF data.

deeprvat.annotations.annotations.add_gene_ids(gene_id_file: str, annotations_path: str, out_file: str)

Add gene IDs to the annotations based on gene ID mapping file.

Parameters: - gene_id_file (str): Path to the gene ID mapping file. - annotations_path (str): Path to the annotations file. - out_file (str): Path to the output file to save the annotations with protein IDs.

deeprvat.annotations.annotations.create_gene_id_file(gtf_filepath: str, out_file: str)

Create a protein ID mapping file from the GTF file.

Parameters: - gtf_filepath (str): Path to the GTF file. - out_file (str): Path to save the output protein ID mapping file.

deeprvat.annotations.annotations.select_rename_fill_annotations(annotation_columns_yaml_file: str, annotations_path: str, out_file: str, keep_unfilled: str)

Select, rename, and fill missing values in annotation columns based on a YAML configuration file.

Parameters: - annotation_columns_yaml_file (str): Path to the YAML file containing name and fill value mappings. - annotations_path (str): Path to the annotations file. - out_file (str): Path to save the modified annotations file. - keep_unfilled (str, optional): Path to save annotations data frame containing NA values before filling them

deeprvat.annotations.annotations.aggregate_and_concat_absplice(absplice_dir: str, ab_splice_agg_score_file: str) None

Aggregates AbSplice scores from multiple files in a directory and saves them to a single Parquet file. This function performs the following steps: 1. Iterates through all files in the specified directory. 2. Reads each file and aggregates the AbSplice scores using the specified aggregation function. 3. Saves the final aggregated scores to a specified Parquet file.

Parameters: - absplice_dir (str): Path to the directory containing AbSplice outputs. - ab_splice_agg_score_file (str): Path to save the aggregated AbSplice score file.

Returns: None

deeprvat.annotations.annotations.merge_absplice_scores(annotations, variants, genes, ab_splice_agg_score_file, output, verbose, mem_limit, fill_nan)

Processes and merges sbsplice scores to annotation parquet file

This function performs the following steps: 1. Loads and transforms genes.parquet by renaming columns and splitting the gene field. 2. Loads annotations.parquet, excluding specified columns if they exist. 3. Merges the modified genes table with the annotations table on the ‘gene_id’ column. 4. Loads variants.parquet and merges it with the previous result on the ‘id’ column. 5. Loads splice.parquet and merges it with the previous result on columns ‘chrom’, ‘pos’, ‘ref’, ‘alt’, and ‘gene_id’. 6. Saves the final merged table to a specified output parquet file.

Parameters: - annotations (str): Path to the annotations.parquet file. - variants (str): Path to the variants.parquet file. - genes (str): Path to the genes.parquet file. - splice (str): Path to the splice.parquet file. - output (str): Path to save the output splice_anno.parquet file. - verbose (bool): If True, prints detailed logging information at each step.

Returns: None