README
README
README
This directory contains reports from the ClinVar dataset and documents about
ClinVar development. Sections in this README are divided by type of content.
This directory has a folder for documents related to the collaboration with ClinGen
(http://www.clinicalgenome.org/). For more details, see
http://www.ncbi.nlm.nih.gov/clinvar/docs/review_guidelines.
This README file also documents ClinVar-related data in other directories, such as
ftp://ftp.ncbi.nlm.nih.gov/pub/GTR/standard_terms, for terminology used by
both GTR and ClinVar.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen, for gene-disease
relationships related to OMIM
http://www.ncbi.nlm.nih.gov/clinvar/
================================================================================
SUBMISSIONS
================================================================================
Details about how to submit are provided at this site:
http://www.ncbi.nlm.nih.gov/clinvar/docs/submit/
--------------------------------------------------
clinvar_submission.xsd
--------------------------------------------------
If you have any questions about submitting data to ClinVar as an XML file, please
contact us via [email protected]
This is the .xsd file for validating records submitted to ClinVar as xml.
------------------------------------------------------------
history of the versions of clinvar_submission.xsd
subdirectory xsd_submission
------------------------------------------------------------
This subdirectory archives different versions of the submission xsd. The current
one is also accessible as clinvar_submission.xsd
--------------------------------------------------
submission_templates
subdirectory ftp://ftp.ncbi.nih.gov/pub/clinvar/submission_templates
--------------------------------------------------
This subdirectory contains excel spreadsheets that can be used to
submit data to ClinVar.
SubmissionTemplate.xlsx
SubmissionTemplateLite.xlsx
================================================================================
EXTRACTS OF CLINVAR DATA
================================================================================
ClinVar data are provided for download as extracts in xml, vcf and tab-delimited
formats in the directories described below.
-------------------------------------------------
Updates
-------------------------------------------------
Data on the ftp site are updated monthly and weekly.
Please note on each file how when the data are refreshed.
-------------------------------------------------
clinvar_public.xsd
-------------------------------------------------
The schema for the export version of the XML
clinvar_public.xsd Link to the current version such as
/xsd_public/clinvar_public_1.5.xsd
------------------------------------------------------------
history of the versions of clinvar_submission.xsd
subdirectory xsd_public
------------------------------------------------------------
This subdirectory archives different versions of xsd used to validate ClinVar's
comprehensive export as xml. The current one is also accessible as
clinvar_submission.xsd
----------------------------------------
release_notes
----------------------------------------
The release_notes subdirectory contains reports of the differences between
versions of clinvar_public.xsd
================================================================================
NAMES OF PHENOTYPES
================================================================================
--------------------------------------------------
disease_names
--------------------------------------------------
This document is updated daily, and is provided to report the names and
identifiers used in GTR and ClinVar. Please note there may be more than one
line per condition, when a name is used by more than one source. This
differs from the gene_condition_source_id file because it is comprehensive,
and does not require knowledge of any gene-to-disease relationship.
Tab-delimited file with the following 7 fields:
--------------------------------------------------
gene_condition_source_id
--------------------------------------------------
--------------------------------------------------
ConceptID_history.txt
--------------------------------------------------
--Added to the directory October 24, 2012
--------------------------------------------------
dbGaP_frequency_study_list
--------------------------------------------------
--Added to the directory September 29, 2015
Text and html files reporting the studies in dbGaP that were assessed for single
nucleotide variants reported in ClinVar.
dbGaP_frequency_study_list.html
dbGaP_frequency_study_list.txt
================================================================================
SUBDIRECTORIES
================================================================================
community files generated in the initial design of ClinVar
presentations slides or other documents about ClinVar
submission_templates templates for submission by spreadsheet
tab_delimited flattened tabular data summaries of several types
-----------------------VCF------------------------
See the README specific to ClinVar's VCF files:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/README_VCF.txt
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/README_VCF.txt
--------------------------------------------------
xml An extraction of data in ClinVar as xml
The xsd for the export version of the XML is clinvar_public.xsd
For more details about the files in the xml directory, please refer to
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/_README
==================================================
tab_delimited sub-directory
Most of the files in this directory are generated weekly, usually on Mondays.
Not all of the files in this directory are archived, so please note the for each
file whether
monthly versions are copies to the archive subdirectory the first of each month.
--------------------------------------------------------------------------------
1. gene_specific_summary
--------------------------------------------------------------------------------
Generated weekly
Archived monthly ( first Thurday of each month)
A tab-delimited report, for each gene, of the number of submissions and the number
of different variants (alleles).
Because some variant-gene relationships are submitted, and some are calculated from
overlapping annotation, in January of 2015, the report was modified to indicate
when the gene-variant relationship was submitted.
Symbol Gene symbol (if officially named, from HGNC, else from
NCBI's Gene database)
GeneID Unique identifier from NCBI's Gene database
Total_submissions Total submissions to ClinVar with variants in/overlapping
this gene
Total_alleles Number of alleles submitted to ClinVar for this gene
Submissions_reporting_this_gene
Subset of the total submissions that also reported the gene
Alleles_reported_Pathogenic_Likely_pathogenic
Number of variants reported as pathogenic or likely
pathogenic
Excludes structural variants that may overlap a gene
Gene_MIM_Number The MIM number for this gene
Number_Uncertain Submissions with an interpretation of 'Uncertain
significance'
Number_with_conflicts Number of VariationIDs for this gene with conflicting
interpretations
--------------------------------------------------------------------------------
2. variant_summary.txt
--------------------------------------------------------------------------------
Generated weekly
Archived monthly (first Thurday of each month)
A tab-delimited report based on each variant at a location on the genome for which
data have been submitted to ClinVar.
The data for the variant are reported for each assembly, so most variants have a
line for GRCh37 (hg19) and another line for GRCh38 (hg38).
Please note: Beginning in October 2016, this file was modified to restrict
reporting to attributes of an AlleleID, not a mixture of AlleleID and VariationID.
The modifications were announced in our September 2016 release notes:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/release_notes/20160901_data_release_notes.pd
f.
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/hgvs4variation.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz
Please note: Beginning in November 2019, the values for referenceAllele and
alternateAllele are being written according to the VCF
standard. For single nucleotide variants there was no change in the
value.
See also the authoritative file for identifiers assigned to genes
represented by NCBI, namely:
ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
--------------------------------------------------------------------------------
3. cross_references.txt
--------------------------------------------------------------------------------
Generated weekly
Not archived
--------------------------------------------------------------------------------
4. var_citations.txt
--------------------------------------------------------------------------------
Generated weekly
Not archived
--------------------------------------------------------------------------------
5. summary_of_conflicting_interpretations.txt
--------------------------------------------------------------------------------
Generated weekly
Archived monthly ( first Thurday of each month)
The header of the file explains the columns, which include the VariationID, the
AlleleID, the type and the HGVS expression. The NCBI GeneID and GeneSymbol are
included for ready filtering of lines in the file by gene. The assembly is provided
for the HGVS expressions based on chromosome sequences; otherwise the assembly is
reported as 'na'.
HINT: Please note that for human, the accession of the RefSeq representing each
chromosome indicates the chromosome being represented. In other words, NC_000001 is
for chromosome 1, NC_000002 is for chromosome 2, ... NC_000023 is for X, and
NC_000024 is for Y.
In the December release, 3 columns were added to support those wishing to identify
which HGVS expressions are used for naming, which HGVS expressions were provided
explicitly by a submitter, and which are based on RefSeqs that are reference
standards on RefSeqGenes.
--------------------------------------------------------------------------------
7. variation_allele.txt
--------------------------------------------------------------------------------
Updated weekly
Not archived
Mapping of ClinVar's VariationID (used to build the URL on the web site) and the
AlleleIDs assigned to each simple variant.
1. VariationID: the identifier assigned by ClinVar and used to build the
URL, namely https://ncbi.nlm.nih.gov/clinvar/VariationID
2. Type: Types of VariationID include Variant (simple variant),
Haplotype, CompoundHeterozygote, Complex, Phase unknown, Distinct chromosomes
3. AlleleID: the integer identifier assigned by ClinVar to each
simple allele
4. Interpreted: _yes_ indicates an interpretation was submitted about
the VariationID specifically,
_no_ indicates that information about the VariationID
was submitted as a component of a different record.
--------------------------------------------------------------------------------
8. submission_summary.txt
--------------------------------------------------------------------------------
Generated weekly
Archived monthly (first Thurday of each month)
--------------------------------------------------------------------------------
9. allele_gene.txt
--------------------------------------------------------------------------------
Updated weekly
Not archived
Reports per ClinVar's AlleleID, the genes that are related to that gene and how
they are related. The values for category are:
asserted, but not computed: Submitted as related to a gene, but not within
the location of that gene on the genome
genes overlapped by variant The gene and variant overlap
near gene, downstream Outside the location of the gene on the
genome, within 5 kb
near gene, upstream Outside the location of the gene on the
genome, within 5 kb
within multiple genes by overlap The variant is within genes that overlap on
the genome. Includes introns.
within single gene The variant is in only one
gene. Includes introns.
10. organization_summary.txt
updated weekly
not archived
================================================================================
VALIDATING FILE DOWNLOADS
================================================================================
We are providing md5 checksum files to validate file downloads to ensure your ftp
transfer is complete. If you are unfamiliar with md5 it is a string of letters and
numbers that act as a fingerprint for a file. When you download the file generate
an md5 hash and compare to the value in our md5 checksum file to ensure your
download has the entire file. We are currently providing md5 files for some of the
tab_delmited files and the ClinVarFullRelease XML file. After you download a file,
use a utility to create a checksum value and compare it to the one we provide. In
the linux enviroment, the utility is likely md5sum. There is freeware available for
tools in other environments.
================================================================================
RELATED SITES
================================================================================
--------------------------------------------------
Gene-disease relationships
--------------------------------------------------
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
--------------------------------------------------
standard terms
--------------------------------------------------
ftp://ftp.ncbi.nlm.nih.gov/pub/GTR/standard_terms/
This directory contains terms used by ClinVar and GTR in specified categories.
Cross-references to the term in other databases may also be provided.
================================================================================
ClinGen
================================================================================
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/ClinGen/
================================================================================
MISCELLANEOUS
================================================================================
--------------------------------------------------
2013.1-hgmd-public.tsv
--------------------------------------------------
This file was removed June 19, 2013 on request from HGMD.
vcf
This path was removed to make explicit the difference between VCF files on GRCh37
and GRCh38.
It was replaced with vcf_GRCh37 and vcf_GRCh38
======================================================================
Partial revision history
======================================================================
December 3, 2014: Added the file /tab_delimited/summary_of_conflicting_data.txt
January 12, 2015: Documented the changes to gene_specific_summary.txt and
explained the new directory structure
for vcf files.
September 10, 2015: Documented providing md5 files.
April 7, 2016: Discontinued generating summary_of_conflicting_data.txt
The historical reports are maintained in the archive directory
(ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/archive/)
August 4, 2016: Added documentation for several new files in the tab-
delimited subdirectory
* hgvs4variation.txt.gz
* submission_summary.txt
* variation_allele.txt
Added explanation of numbering systems for locations in VCF
vs. all other reports
September 24, 2016: Added documentation for the allele_gene files in the tab-
delimited directory
November 9, 2016: Corrected documentation for variant_summary that should have
been included in the October update.
December 9, 2016: Added SubmittedGeneSymbol to the description of
submission_summary.txt, and updated the description of hgvs4variation.txt.gz
January 19, 2017 Corrected inconsistent definition of NumberSubmitters in
variant_summary.txt
November 30, 2017 Modification to summary_of_conflicting_interpretations.txt
February 20, 2019 Added documentation of organization_summary.txt
November 12, 2019 Noted change in reporting referenceAllele and
alternateAllele in variant_summary.txt; added more information about releases.