Download

About MutationBase

MutationBase is an open-access database focusing on pathogenic mutations associated with monogenic inherited diseases. Our dataset is meticulously compiled by leveraging Large Language Models (LLMs) to extract information from an extensive corpus of biomedical literature, including over 40 million PubMed abstracts and 7 million PMC full-text articles. MutationBase aims to provide a comprehensive and structured resource for clinical genomics and mutation research.

Data Overview

This directory contains pathogenic mutation records retrieved from MutationBase. The data is partitioned into two formats based on whether the variants could be standardized (normalized) against a reference genome.

1. VCF Files (.vcf)

Scope: Contains pathogenic mutations that have been successfully normalized to specific genomic coordinates.

The VCF file follows the standard specification. Biological annotations are stored within the INFO column, using the following standardized fields:

Field Description
GENE_ID NCBI Entrez Gene ID(s).
GENE_SYMBOL Normalized HGNC gene symbol(s).
HGVS_ALL HGVS nomenclature for the variant (e.g., c.123G>A).
MUTATION_TYPE Mutation type(s) (e.g., SNV, Insertion, Deletion).
DISEASE Disease(s) linked to this mutation.
STRAND Genomic strand orientation (+/-).
PMID PubMed IDs providing new evidence for this mutation.

2. JSONL Files (.jsonl)

Scope: Contains pathogenic mutations that could not be fully normalized to genomic coordinates (e.g., lack of specific positions or non-standard nomenclature).

Each line is a JSON object with the following fields, consistent with the dataset's core schema:

Field Description
Mutation_ID Unique identifier in the MutationBase dataset.
Gene_ID NCBI Entrez Gene ID(s). Set to ["None"] if the gene cannot be normalized.
Gene_Symbol Normalized HGNC gene symbol(s). Set to ["None"] if the gene cannot be normalized.
Gene_Desc Original gene description as mentioned in the source literature.
Mutation Original mutation description as mentioned in the original paper.
Mutation_Type Mutation type(s) (e.g., SNV, Insertion, Deletion).
Disease Disease(s) linked to this mutation.
PubMedID PubMed ID(s) providing new evidence for this mutation.

Export All Data

Download the complete MutationBase dataset, including both VCF and JSONL files along with the database documentation.

Contact & Support

If you encounter any issues with the data format or have technical questions, please contact our team via email: fangli9@mail.sysu.edu.cn