A reference sequence - discussions and FAQs
Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.
To indicate what type of reference sequence is used for the description of the sequence variant, specific indicators are used;
For details on residue numbering see Standards. Note that recommendations also exist to describe different transcripts/protein isoforms generated from one gene (see Standards).
Discussions on a proper reference sequence have been very lively. In general it can be concluded that all suggestions made have their pro's and con's, but there is no perfect solution.
Theoretically, a genomic reference sequence is the best choice. By simply numbering nucleotides from 1 to the end of the file no problems occur with complex gene structures like multiple transcription start sites (promoters / 5'-first exons), multiple translation initiation sites (ATG-codons), alternative splicing and the use of different 3'-terminal exons and poly-A addition sites.
In practice a coding DNA reference sequence is mostly preferred. The most important reason is that from the description one immediately gets some information regarding the location of the variant; exonic or intronic, 5' of the ATG or 3' of the stop codon and, by dividing the nucleotide number by 3, the number of the amino acid residue that is affected (see Nucleotide numbering).
- for a human, a genomic reference sequence does not contain any useful information (a coding DNA reference sequence does)
- a gene can be very large (over 2.0 Mb) - this makes nucleotide numbering based on a genomic reference sequence rather impractical (e.g. g.1567234_1567235insTG). Furthermore, genomic reference sequences based on GenBank NT_ files become increasingly long (e.g. the CFTR gene in NT_007933.15, >77 Mb) and consequently loose their informativity. Downloading such large files is, even with good internet connections, time consuming and working with these files is rather difficult.
- when a genomic reference sequence is taken from a complete genome sequence, e.g. a bacterium or the human X-chromosome, the transcriptional orientation of the gene of interest may be on the minus (-) strand. This makes the description of sequence variants rather complicated, especially when the consequences on RNA and/or protein level need to be described; nucleotides on DNA and RNA level are complementary and numbering goes in different directions - a confusing situation that should be prevented.
- when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).
- when the gene sequence is incomplete (especially when large introns are present) - a genomic sequence can not be used.
- genes may contain very large introns with many intronic (length) variants present in the population - it is thus very difficult to give THE genomic reference sequence (see Genomic sequence changes regularly).
- the exact transcriptional start site (cap-site) of a gene has often not been determined and/or its assignment is debated - the first nucleotide can thus not be assigned with certainty. The same might be true for the translation initiation site (ATG-codon).
- a gene may have several transcripts, using different promoters / 5'-first exons, alternatively spliced internal exons, different 3'-terminal exons and polyA-addition sites - one complete coding DNA reference sequence can thus not be generated (see Alternatively spliced exons - nucleotide numbering),
- the different transcripts may encode different proteins (isoforms) with, when different promoters are used, different N-terminal sequences and even using different reading frames in one or more exons. One complete protein reference sequence can thus not be assigned.
- when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).
The recommendation is to use a LRG (Locus Reference Genomic sequence, e.g. LRG_329. Information on LRG's (Dalgleish et al. 2010, MacArthur et al. 2014), and how to get one for your gene of interest, can be found at the LRG website. When no LRG is available, one should be requested. In the mean time, a RefSeqGene record is a good alternative (RefSeq database, format NG_008797.2). When both are not available, request an LRG, a RefSeqGene record will be made in parallel. DO NOT use LRGs that are "pending", they might change before officially released.
Reporting using LRGs as a reference is possible for genomic DNA (e.g. LRG_1:g.8463G>C), coding DNA (e.g. LRG_1t1:c.572G>C), non-coding RNA (e.g. LRG_163t1:n.5C>T) and protein (e.g. LRG_1p1:p.Gly191Ala) variants. To describe coding DNA/non-coding RNA variants the transcript must be indicated (e.g. "t1"), for protein variants the protein isoform (e.g. "p1").
When a genomic reference sequence is used the following recommendations should be followed;
For complex genes, when on the genomic reference sequence all transcripts are annotated properly, computational tools (like Mutalyzer and the Genomic Mutation Consequence Calculator) can easily predict the consequences of a sequence change on all transcripts and their encoded protein isoforms, incl. when they derive from overlapping genes.
When a coding DNA reference sequence is used the following recommendations should be followed;
NOTE: the numbering suggested can also be applied to amino acid numbering (e.g. p.Ala456+1, Cys456+2, ..., etc. or ..., p.Phe457-2, p.Gln457-1)
Suggestions have been made to extend the recommendations for the nucleotide numbering of coding DNA reference sequences to specifically indicate untranscribed nucleotides (see Discussion).
As discussed, genes can be rather complex and the choice of a good coding DNA reference sequence can be very difficult. Below we will refer to some examples of how experts have tried to resolve the issue.
Do you have other examples - please let us know (E-mail to: J.T.den_Dunnen @ LUMC.nl) !.
The HGVS recommendations for the description of sequence variants does not include suggestions for the numbering of exons and introns. The simple reason is that exon/intron numbers are not required for a correct description. When necessary, the exon/intron numbers can be derived from the description at DNA level.
In fact, using exon/intron numbers introduces a lot of confusion, which is undesired; assume an exon number is in conflict with the description of the variant at DNA level, what to do ?. In many genes there is no consensus on exon/intron numbering and originally used numbering schemes had to be revised to include newly discovered exons (internal as well as 5' and/or 3' of the gene). This led to all kinds of numbering schemes using no consensus or overall logic, making it very difficult for non-experts in the specific gene to keep track of all details (see Dalgleish et al. 2010). With the increasing use of genome browsers, numbering exons simply from start to end 1, 2, 3, etc., legacy numbering schemes have become even more confusing.
The only logical thing to do is to follow the standard set by the genome browsers and to start numbering with 1 for the first exon. Although this is probably difficult to accept by the experts, we can not keep on confusing newcomers by forever using legacy numbering systems. We should realize that, at some point wrong assumptions will be made and a patient wiil end up with an erroneous diagnosis, which is of course unacceptable.
Describe variants at DNA level and do not include exon or intron numbers as part of the description. Exon and intron numbers may be mentioned but only when there use is specified and reference sequences for the exons and introns are gvien. Since history will leave its tracks, when refering to older data, always mention changed numbering schemes in M&M and in Figure and Table legends to prevent any confusion. For tables even consider to add an additional column indicating the legacy numbering.
Initial recommendations (see e.g. Antonarakis  Hum.Mut. 11: 1-3) suggested two alternative descriptions for variants in intron sequences based on a coding DNA reference sequence; the formats c.88+2T>G / c.89-1G>T and c.IVS2+2T>G / c.IVS2-1G>T. The current recommendation is that the format c.IVS2+2T>G / c.IVS2-1G>T should not be used anymore.
Reason: from the description c.IVS2+2T>G it is difficult
to deduce where the position of the intron relative to the coding DNA
sequence is. In addition, when one wants to deduce this position, this is
often problematic. First, many authors fail to mention the genomic +
coding DNA reference sequences that were used as the basis of exon/intron
numbering. Second, since on first publication gene sequences are often
based on incomplete sequences, initial exon / intron structure often turns
out incomplete and numbering changes later (see Numbering
exons / introns). Consequently, descriptions using the
format c.IVS2+2T>G fail the basic criterion to be unequivocal
and should thus not be used. Descriptions using the format c.88+2T>G do
not suffer from these problems.
NOTE: when intronic variants are described in relation to a coding DNA reference sequence authors should not forget to mention the genomic reference sequence where the intron sequence can be found.
A basic recommendations is to use the shortest description as much as possible. Therefore, in the middle of an intron nucleotide numbering changes from + to - (e.g. from "c.88+.." to "c.89-.."). In addition, when a change in an intron is described as c.88+4356A>G (in stead of c.89-2A>G) it will not be clear that the change might be close to the splice acceptor site, and thus might affect splicing. This is immediately clear when the description c.89-2A>G is used.
When description in relation to a Reference Sequence is problematic could one specify the change in between 20 bp of sequence on both sides ?.
In many cases this would be OK but for recently duplicated genes or genes which contain repeated segments even giving 20 nucleotides to either side will not be sufficient. Furthermore, descriptions will become very long. For problematic cases the best method is probably to include the raw data, i.e. the sequence file itself.
When I retrieve a cDNA sequence from GenBank nucleotide numbering does not start with +1 at the A of the ATG translation initiation codon.
True, but such a file can be simply obtained. When you retrieve the sequence from the RefSeq-database (i.e. start at EntrezGene, enter the gene symbol or gene name, select the gene of interest, click the mRNA entry) it will be annotated extensively (see Example). Clicking the "CDS" annotation (CoDing Sequence) opens a window where the nucleotide numbering will start with 1 at the A of the ATG translation initiation codon (see Example). To assist those studying or reporting sequence variants a locus specific database (LSDB, see HGVS - list of LSDBs) usually provides the coding DNA reference sequence with the nucleotide numbering (see Example).
The recommendation on numbering genomic and coding DNA variants based on the first nucleotide of the initiation codon ATG is workable only if the reference sequence in the database is published as a single file. In the case of the gene CDKN2A, its genomic sequence is stored as multiple files, each containing one exonic sequence and partial intronic sequences on both ends of the exon. I can use the above recommendation easily to number variants in exon 1 where the initiation codon is located. The problem is how should I number variants in exon 2 which is located in another database file ?.
If no database file is available that contains the complete genomic sequence, a coding DNA Reference Sequence, preferably from the RefSeq database, should be used. Since for many organisms a genome sequence is freely available, a database curator can easily make a fully annotated file (genomic and coding DNA) covering the sequence of interest and submit it to the RefSeq database. This file can than be used as the reference sequence.
Question (Tracy Lester, Oxford, UK)
We are wondering how to name variants in ZRS, a regulatory sequence for SHH that lies 1 Mb upstream of SHH in intron 5 of LMBR1. Variants in ZRS are associated with various limb abnormalities and to-date have been numbered according to a sequence which does not follow HGVS guidelines. Should we create a genomic reference sequence for SHH that includes 1 Mb of upstream sequence to encompass the ZRS, number it according to the LMBR1 reference sequence, or something else?
A difficult case. I see 3 options;
Question (Isabelle Touitou, Montpellier,
If the first translation ATG is in exon 2, and we find a variant 5' to exon 1, should we include intron 1 in the counting process?.
NOTE: based on a coding DNA reference sequence intron 1 is located between nucleotides -15 and -14.
Nucleotides in introns 5' of the ATG translation initiation codon (i.e. in the 5'UTR) are numbered as all other nucleotides (see Examples and Figure). In your example, based on a coding DNA reference sequence, an intron is present between nucleotides -15 and -14. The nucleotides for this intron are numbered as -15+1, -15+2, -15+3, ...., -14-3, -14-2, -14-1. Consequently, regarding the question, when a coding DNA reference sequence is used, these intronic nucleotides are not counted.
The CBS gene was originally thought to contain 16 exons. Later it was recognised that exon 15 does not exist, and recently two additional non-translated 5' exons were detected. The current gene structure therefore includes 17 exons, of which exons 3 to 17 are translated. Should the exons of a gene be counted from the exon that contains the start codon rather than the beginning of the cDNA ?. If so, should exons preceding the start codon be counted 0, -1, -2, etc. or should the 0 be skipped ?. Is there an agreement on how to deal with corrections in exon numbering ?.
For the description of sequence changes it does not matter how exons are numbered !; exon (and intron) numbers are not used in the descriptions. In fact this is one reason why the recommendation is as it is (see Description of intronic variants). Examples (using a coding DNA reference sequence);
- c.-5G>T: a change 5' of the ATG (in the 5'UTR)
- c.5G>T: a change in the coding (related to a change in amino acid 2)
- c.256+1G>T: a change in the 5' end of an intron
- c.257-1G>T: a change in the 3' end of an intron
- c.*5G>T: a change 3' of the stop codon (in the 3'UTR)
For exon numbering the only logical thing to do is to start with 1
for the first exon, otherwise eventually problems will emerge.
For other numbering schemes only the experts will know the
history; newcomers just blindly assume that the first exon
is exon 1. Consequently, when historic numbering schemes are used, at some
point wrong assumptions will be made and a patient might end up with an
However, since history will leave its tracks it is suggested to always mention changed numbering schemes in M&M and in all Figure and Table legends to prevent any further confusion. For tables even consider to add an additional column indicating the historic / old exon number.
Question (Alessandra Splendore, Rio de Janeiro,
Recently two previously unidentified exons of the TCOF1 gene were identified, and named 6A and 16A. Exon 6A is present in most of the transcribed isoforms, exon 16A is included only in minor isoforms. In updating the nomenclature of reported mutations in TCOF1, should I use a sequence that corresponds to the major isoform (with exon 6A, but without 16A) or the sequence that corresponds to the longest ("most complete") isoform ?.
This is the eternal problem of changes in the coordinates of a reference sequence. The best solution is that the TCOF1-community gets together and decides to use an updated reference sequence representing the most complete transcript, i.e including exons 6A and 16A. This updated sequence should be annotated properly, submitted to the RefSeq database and used from then on.
Question (JM Friedman, Vancouver,
We are working on a new locus-specific mutation database for NF1 and NF2, and we have run into a problem with the standard mutation nomenclature based on the genomic sequence. The problem is that the canonical genomic sequence (and consequent numbering) we are using as the basis of the mutation nomenclature has changed repeatedly since many of the mutations were described, and it is continuing to change. If we use the names assigned to the mutations on the basis of the version of the sequence that was used to name the mutations, they do not map to the proper position in the current version of the sequence. If we change the names to match the new sequence, they will not match the published names for these mutations and may need to be changed again the next time time the sequence changes. (Actually, the current version of the NF1 sequence is annotated on the wrong strand, so all the numbering would be backwards if we used the annotated strand instead of its complement, which is the really the correct one).
The solution to identifying the mutation unequivocally is to provide enough of the surrounding sequence to permit a unique result on a BLAST search, and we are doing this. However, this does not solve the problem of naming the mutations. What is your recommendation for this ?.
Indeed the problems you mention make live very hard. In fact, especially with genes containing large introns, there will be no one genomic reference sequence since every gene will be slightly different (see above). The problem of continuously changing genomic sequences will not settle rapidly. When designating "THE genomic reference sequence" now one can already foresee future discussions whether this choice was proper; it will be a "random pick" and might not be the evolutionary correct choice. The way to go in our eyes is to declare one sequence THE genomic reference sequence (starting several kilo base pairs 5' of the promoter region), annotate it properly, submit it to the RefSeq database and use it from then on. The RefSeq database has NG_ files specifically made for this purpose (see e.g. NG_000004.2). These problems are one of the reasons why for the LSDB's I curate (i.e. Johan den Dunnen), I prefer a coding DNA Reference Sequence. In that case the effect of the ever changing intronic sequences has only a marginal effect.
For genes that are on the minus strand of a chromosome (opposite transcriptional orientation) the description based on chromosome coordinates may differ significantly from that based on the coding DNA reference sequence. Say the chromosome sequence is -TGGGGCAT- and one of the G's is deleted (change to -TGGG_CAT-). Based on chromosome coordinates the description is g.5delG. However based on the coding DNA reference sequence (ATGCCCCA) the description is c.7delC. Not only is the deleted nucleotide different (delG vs. delC), in fact the descriptions also point to another nucleotide, g.5 vs. g.2 (equal to c.7delC). Is this correct?
Yes, this is correct. When genes are on the minus strand of a chromosome (opposite transcriptional orientation) and the change is located in a repeated sequence (mono-, di-, tri-, etc. stretches) the rule that for all descriptions the most 3' position possible should be assigned (see General recommendations) has this as a consequence.
We are preparing an annotated set of Hox genes from the zebrafish for publication. If the coding DNA sequence is not completely known, but only an EST lacking 5' sequence and a genomic sequence covering the EST, how do you describe a change in this sequence; do you number it in relation to the EST or the genomic sequence ?. Furthermore, if there is a mismatch between the genomic and the EST sequence, and you don't know which one is correct, how do you define e.g. whether the genomic sequence has an insertion or the EST has a deletion ?.
First, the reference sequence chosen is always assumed to be the correct sequence simply because changes are described in relation to this sequence.
Second, when the EST sequence is incomplete one should describe changes in relation to this sequence like AA010203.2:54_55insG (assuming the reference sequence used is AA010203.2). So do not use a 'c.' or 'g.' prefix, since neither a coding DNA nor a genomic reference sequence is used. However, when a genomic sequence covering this EST is available the recommendation is to use this as a reference sequence.
Making a judgment on what is the "wild type" (wt) nucleotide for some sequences seems arbitrary at best. How would you suggest that the description be presented for these ?.
Changes are always described in relation to a "reference sequence". This reference sequence is considered to be the "wild type" sequence and is expected to be the one present in the database (GenBank). Consequently, reference and wild type sequence can be different. Note however that everybody has influence on the sequences in the RefSeq database and thus may request that a variant is changed into the more common allele. However, the debate about what is wild type can be unsolvable when variants are very common (near 50%) or differ between populations.
Question (M Paalman, Human Mutation)
How should sequence variants in the mitochondrial DNA (mtDNA) be described ?.
The mtDNA genome is rather small, completely sequenced and numbered. According to current recommendations variants in the mitochondrial DNA should be described in relation to a the full mitochondrial DNA sequence, i.e. for human the Homo sapiens mitochondrion, complete genome (GenBank NC_012920.1). Descriptions should be preceded by "m.", like m.8993T>C (see Recommendations). The mtDNA encodes a range of different proteins. To prevent confusion, changes at protein level should be described including a reference to the protein changed, like ATP6:p.Leu156Pro (GenBank YP_003024031.1, ATP synthase 6).
NOTE: for issues related to mitochondrial DNA sequences see MITOMAP.
How should sequence variants be described in genes that produce only RNA (so no protein), e.g. ncRNA, miRNA, etc. ?.
To describe variants in genes that produce an RNA molecule but no protein a genomic reference sequence can be used ("g." description). When available, it is also possible to use a NR_ transcript reference sequence (e.g. NR_000020.1 for the small nucleolar RNA, C/D box 33 (SNORD33) gene) using the prefix "n." ( see Standards). Numbering for the transcript reference sequence starts with position "n.1" and ends with the last position.
NOTE: suggested addition, see SVD-WG002
| Top of page | MutNomen
homepage | Check-list | Symbols,
| Recommendations: DNA, RNA, protein, uncertain |
| Discussions | FAQ's | History |
| Example descriptions: QuickRef, DNA, RNA, protein |