About MutationBase
MutationBase is an open-access database focusing on pathogenic mutations associated with monogenic inherited diseases. Our dataset is meticulously compiled by leveraging Large Language Models (LLMs) to extract information from an extensive corpus of biomedical literature, including over 40 million PubMed abstracts and 7 million PMC full-text articles. MutationBase aims to provide a comprehensive and structured resource for clinical genomics and mutation research.
Data Overview
This directory contains pathogenic mutation records retrieved from MutationBase. The data is partitioned into two formats based on whether the variants could be standardized (normalized) against a reference genome.
1. VCF Files (.vcf)
Scope: Contains pathogenic mutations that have been successfully normalized to specific genomic coordinates.
The VCF file follows the standard specification. Biological annotations are stored within the INFO column, using the following standardized fields:
| Field | Description |
|---|---|
| GENE_ID | NCBI Entrez Gene ID(s). |
| GENE_SYMBOL | Normalized HGNC gene symbol(s). |
| HGVS_ALL | HGVS nomenclature for the variant (e.g., c.123G>A). |
| MUTATION_TYPE | Mutation type(s) (e.g., SNV, Insertion, Deletion). |
| DISEASE | Disease(s) linked to this mutation. |
| STRAND | Genomic strand orientation (+/-). |
| PMID | PubMed IDs providing new evidence for this mutation. |
2. JSONL Files (.jsonl)
Scope: Contains pathogenic mutations that could not be fully normalized to genomic coordinates (e.g., lack of specific positions or non-standard nomenclature).
Each line is a JSON object with the following fields, consistent with the dataset's core schema:
| Field | Description |
|---|---|
| Mutation_ID | Unique identifier in the MutationBase dataset. |
| Gene_ID | NCBI Entrez Gene ID(s). Set to ["None"] if the gene cannot be normalized. |
| Gene_Symbol | Normalized HGNC gene symbol(s). Set to ["None"] if the gene cannot be normalized. |
| Gene_Desc | Original gene description as mentioned in the source literature. |
| Mutation | Original mutation description as mentioned in the original paper. |
| Mutation_Type | Mutation type(s) (e.g., SNV, Insertion, Deletion). |
| Disease | Disease(s) linked to this mutation. |
| PubMedID | PubMed ID(s) providing new evidence for this mutation. |
Export All Data
Download the complete MutationBase dataset, including both VCF and JSONL files along with the database documentation.
Contact & Support
If you encounter any issues with the data format or have technical questions, please contact our team via email: fangli9@mail.sysu.edu.cn