Coursera BioinfoMethods-I Lab01 PDF
Coursera BioinfoMethods-I Lab01 PDF
Coursera BioinfoMethods-I Lab01 PDF
1
Copyright 2014 by D.S. Guttman and N.J. Provart
LAB 1a EXPLORING NCBI
[Software needed: web access]
The National Center for Biotechnology Information (NCBI) maintained by the US National
Library of Medicine and National Institutes of Health is one of the worlds most important
resources and repositories for biological data. This fantastic online resource provides an
extensive network of databases cataloging an ever-growing wealth of genetic, medical, and
biochemical information from all walks and crawls of life. Entire genomes, from viruses to
humans, are compiled, organized, and cross-referenced within these networks, such that surfing
the genome can be almost as easy as surfing the web.
But you have to know a) what youre looking for, and b) what youre looking at to get anything
out of these databases. This is what this first lab is going to help you do. Note that Google and
other search engines typically do not index database-driven websites, which is why it cannot be
used for searching for information that is stored at NCBI.
The primary portal for accessing data at NCBI is called GQuery. But first, lets start by visiting
NCBIs website and examining the interface, which undergoes constant change.
1. Open your Web browser and go to NCBIs homepage: www.ncbi.nlm.nih.gov. This page
provides links to all of NCBI databases and resources. Its worth exploring here just to
get a better idea of the scope of NCBI. If you click About the NCBI you will be taken to
a page summarizing some of these resources. The Science Primer provides a nice
introduction to some of the tools and methods used. You can also check out the NCBI
handbook (http://www.ncbi.nlm.nih.gov/books/NBK21101/) for more information.
Figure 1. The NCBI homepage.
Bioinformatic Methods I Lab 1
2
Copyright 2014 by D.S. Guttman and N.J. Provart
2. Now lets move to the GQuery portal select All Databases from the navigation bar at the
top of the NCBI start page, by clicking Search on the empty field. First, scan down the
assortment of databases queried through GQuery. You will notice there is everything from
the biomedical literature at PubMed to nucleotide databases, taxonomy databases, protein
structure databases, and expression profile databases. Lets see what happens when you do
an unguided search on the site. In the "Search across databases" box, type in bacteria. The
output is a summary page. A search of bacteria gives thousands of hits not very helpful.
We need specifics.
Figure 2. The GQuery portal page with bacteria used as a search word.
3. Usually when searching these databases, you have either a region of DNA or a protein (or
protein function) of interest. For this lab youll be using a gene from Arabidopsis thaliana, a
small flowering plant that is like the fruit fly of the plant world as it has a comparatively
rapid life cycle and requires little space to grow. The protein product of this gene is recorded
under accession number NP_565676, and it is a structural component of the ribosome.
Bioinformatic Methods I Lab 1
3
Copyright 2014 by D.S. Guttman and N.J. Provart
4. Go back to the NCBI GQuery portal page and try a more focused search. Use the search
terms found associated with the gene sequence well be using with the GenBank Field
Qualifiers shown below (a full list of qualifiers is presented in Appendix 1). Try the four
different searches presented below:
gene keywords
e.g. structural constituent of ribosome
gene keyword AND organism
e.g. structural constituent of ribosome AND Arabidopsis thaliana
gene keyword [PROT] AND organism [ORGN]
e.g. structural constituent of ribosome [PROT] AND Arabidopsis thaliana
[ORGN]
accession or gi number
e.g. NP_565676
That narrowed things down significantly!
Note that using parentheses can be very helpful in making sure you get exactly what you
want. For example:
SMC AND (yeast [ORGN] OR Arabidopsis [ORGN])
is a very different search than
SMC AND yeast [ORGN] OR Arabidopsis [ORGN]
Also, using quotation marks can also dramatically affect your search (ie: 16s rRNA vs. 16s
rRNA).
Finally, always capitalize the Boolean operators such as AND / OR / NOT.
Ultimately, the most specific search items you can use are gi or accession numbers.
Box 1. Accession Numbers, Version Numbers, and GI Numbers
An Accession number is a unique identifier for a particular sequence record. An accession
number is assigned to a specific record and stays with that record forever. In other words,
Accession numbers track a particular record and do not change even if the information in the
record is changed at the author's request (e.g. if a better annotation or more complete sequence
is provided). Accession numbers are usually a combination of a letter(s) and numbers, such as a
single letter followed by five digits (e.g., U12345) or two letters followed by six digits (e.g.,
AF123456).
Version numbers follow the Accession number and indicate the revision history of that entry
starting with 1 and increasing with each revision. The standard format is Accession.Version.
Lab Quiz
Question 1
Bioinformatic Methods I Lab 1
4
Copyright 2014 by D.S. Guttman and N.J. Provart
A GI number (GenInfo Identifier sometimes written in lower case, "gi") is simply a series of
digits that are assigned consecutively to each sequence record processed by NCBI. The GI
system of identifiers runs parallel to the accession.version system; therefore, if the DNA or
protein sequence changes in any way, it will receive a new GI number,
Example: When a new entry is submitted to GenBank it will be assigned an accession number
(say AF000001). Since this is the first version the Accession will be appended with .1, so it
will look like AF000001.1. At the same time it will be given a GI number (say GI:1234567).
Now imagine that the researcher who originally submitted the record wants to update the
information. The updated record will keep the same Accession number, but increase in version
number (AF000001.2), which the new record will be given a completely new GI number (say
GI:9876543).
Why is this important? The Accession number will always give you the most up to date
information on a record, while the GI number will always take you back to a specific record.
There are times when you want the most current information, and other times when you want to
point to a particular piece of information from a particular point in time (e.g. a particular record
that you did an analysis with), even if more information has been subsequently added.
Box 2. NCBI Help
This is a good time to get familiar with NCBIs thorough Help index for future reference. With
this index, you should be able to access most of the background you need for understanding
how these databases work on your own (theres also an NCBI YouTube channel, if youre so
inclined to acquire your information that way).
1. To the right of the search text box on the GQuery portal page is the Help icon.
Click on it.
2. You are now in Entrez Help. The Entrez collection of databases is queried when you use
the GQuery interface. Note the section in the right sidebar that explains everything from
search options to saving sets of records.
3. Notice that under the section Using the Advanced Search Page to Construct Complex
Search Statements some other appropriate qualifiers are given.
5. Search for your given accession number through the GQuery portal page (e.g. NP_565676
from above). It should give you one protein sequence hit. Click on it and the following link
so that you get its full GenBank description.
Bioinformatic Methods I Lab 1
5
Copyright 2014 by D.S. Guttman and N.J. Provart
Figure 3. GenBank record for accession NP_565676, in GenPept format.
Bioinformatic Methods I Lab 1
6
Copyright 2014 by D.S. Guttman and N.J. Provart
6. Notice all the hyperlinks within the text. It looks messy, but is in fact straightforward. For
example, for taxonomic information, click on the SOURCE ORGANISM hyperlink. Some
records have links to the primary publication where this sequence was originally cited in a
PUBMED number hyperlink (not the case in the above example, but there is a PubMed
reference for the sequence). Click around on different links and see what you find.
a. What is the taxonomic lineage of your organism?
b. Has the genome of this organism been sequenced, i.e. is there a Genome Project?
c. If so, can you find the accession for the full sequence or one of the chromosomes?
To find out much more information on the structure of the GenBank file at
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
7. Go back to the GenBank record and click on the CDS link, just above the actual sequence
(circled in red in Figure 3 on the previous page).
a. Where did this take you or what happened when you did this?
8. Go back to the GenBank record and examine the Related Information section on the lower
right. This gives you direct links to other databases with information on this query. Find the
Gene link.
Figure 4. The Links (Related Information) menu
Bioinformatic Methods I Lab 1
7
Copyright 2014 by D.S. Guttman and N.J. Provart
9. Select Gene from the Links menu. This is a great starter resource at NCBI. Scroll through
the different sections. Use them to answer the following questions.
a. Where is your genes position in the genome (tip: mouse-over the green bar, which
represents the gene in the sequence viewer)?
b. What are the names of the genes surrounding it (genomic context)?
c. Does it have any conserved domains (scroll down to the Genome Annotation
section)? What are they called?
d. What biological process (Gene Ontology terms) is this gene involved with (again,
scroll down!)?
Figure 5. GenBank Gene page for AT2G28830.
10. On the Gene page, there are also other links (see the sidebar on the right) to examine a genes
structure, function and phylogenetic relationships further.
Click on Additional Links. a. What kind of information does this section tell you?
Return to the Gene page and click on Map Viewer from the Related Information menu.
Use the selector on the left hand side of the screen to zoom in and out. Scroll along the
genome to see the order of genes. Use the genes locus tag (found on the Gene page) to
Bioinformatic Methods I Lab 1
8
Copyright 2014 by D.S. Guttman and N.J. Provart
find your gene this way using the Search function. You may have to zoom in pretty far.
You can also scroll along the genome by clicking the small up and down arrows at the
top and bottom of the Genes_seq genome graphic.
Click on the small black box pointed to by the genes locus tag.
b. How many exons do you see in this gene? Tip: this can also be determined from
the Gene pages sequence viewer entryhow many green bars are there?
Go back to the Map Viewer.
Click around and explore the variety of ways that data is interconnected and displayed
(dont worry, you cant break anything).
Figure 6. NCBI Map Viewer for part of Arabidopsis thaliana chromosome 2.
Box 3. Helpful Hints for GQuery searches
Go back to NCBI GQuery, search for your gene again using your given accession number.
Click on Save Search beside the search box. Register for an account and save your search.
You can also combine previous searches using the History tab and the search numbers listed
within it, as well as save your searches by registering for a myNCBI account, so you dont have
to keep redoing the same searches in the future.
Bioinformatic Methods I Lab 1
9
Copyright 2014 by D.S. Guttman and N.J. Provart
Lab 1b Basic BLAST (blastn)
One of the most important bioinformatic strategies used for the functional annotation of
genes and genomes is to predict the function of uncharacterized genes or proteins based on
their similarity to sequences with better functional annotations. BLAST is perhaps the
single most important tool for finding database sequences that are similar to a query of
interest.
Box 4. BLAST and Homology
The Basic Local Alignment and Search Tool (BLAST) is a very power approach to identifying
database sequences that share local similarity to a query sequence (see below for definitions).
There is a very important chain of assumptions used in biological research that is generally
followed when using BLAST:
Homologous genes share sequence similarity
Orthologous genes have the highest similarity among multiple species
Orthologous genes most likely have similar functions
Consequently, sequences that are most similar between multiple
species share similar functions
Note, it is very important to understand that these are only assumptions, and there are many
reasons and instances where these assumptions prove to be false. Nevertheless, they are a
reasonable starting place.
Definitions:
Similar sequences sequences that share a significant number of residues (nucleotides
or amino acids). Sequences can be similar due to homology or simply by chance. The
higher the similarity between sequences, the more likely they are to be homologous.
Homologous sequences sequences that are related through common ancestry.
Homology is qualitative two sequences either are, or are not related through common
ancestry. Homologous sequences can vary greatly in their level of similarity from
100% to 0%.
Orthologous sequences sequences that are related through a past speciation event.
Orthologous sequences are assumed to share common functions.
Paralogous sequences sequences that are related through a past gene duplication event.
Genes often diverge in function after duplicating; therefore, paralogous sequences are not
assumed to share a common function.
Query sequence your sequence; the sequence you are interested in finding more about.
High Scoring Segment Pair (HSP) hits to the database. A subsequence match
between your query sequence and a database sequence returned by BLAST.
Local alignment a sequence alignment that extends only across part of the sequence.
Global alignment a sequence alignment that extends across the entire sequence (from
end to end).
Bioinformatic Methods I Lab 1
10
Copyright 2014 by D.S. Guttman and N.J. Provart
1. First, we need a query sequence for the search. Lets start with our given gene again, but this
time well use its corresponding nucleotide sequence, not its protein sequence. First try
finding the genes DNA sequence using GQuery again.
On the GQuery Portal (All Databases) page, search for your given protein sequence again
using the Accession or GI number (or alternatively, go back to the search you saved in
your NCBI account). Using the protein from the first part of this lab, we would search
for NP_565676.
The first page that comes up is the summary page. Once youre on this page you can
move to the database of interest. In this case you probably don't have hits in too many
databases since you had a very specific search.
Figure 7. GQuery portal queried for NP_565676 (partial view).
Try clicking the Gene link. Does the Gene page give you the gene sequence alone? What
do you get instead? Note the context specific link menus that pop up when you hover
over the graphic of the gene with your mouse pointer. You can click on the icon in the
pop up menu to get links to various sequences and analyses associated with the gene.
Note that the green track is a composite of the mRNA and CDS tracks click on either
the NM_ or NP_ number to see the deconvolution of the green track (Figure 8)
Bioinformatic Methods I Lab 1
11
Copyright 2014 by D.S. Guttman and N.J. Provart
Figure 8. Part of the Gene page for NP_565676, showing pop-up to sequence links.
Click on the mRNA link (NM_128442.5 the M in the accession number denotes
mRNA) and select GenBank View (you may need to scroll to the right to access this
link). This takes you to the mRNA that encodes the protein you have been looking at.
Notice the feature list in the record. One feature is gene, and corresponds to base
position 1 3218 on this record. Another features is the coding sequence (CDS), which
corresponds to base position 77 2965.
a. Given your biology background knowledge, why do you think these are different?
On the pop-up on the Gene page click on the Nucleotide Link (NC_003071.7), and select
GenBank View. This takes you to the genomic region that encodes the mRNA you were
just looking at. Notice how the gene feature corresponds to positions 1 3937, while the
mRNA feature corresponds to positions 1-814, 882-1007, 1129-1394, 1670-2768, 2855-
3304, 3388-3504, and 3592-3937, while the CDS feature corresponds to positions 254-
814, 882-1007, 1129-1394, 1670-2768, 2855-3304, 3388-3504, and 3592-3861.
b. Again, why are these different? Tip: recall the Central Dogma of Molecular
Biology.
Bioinformatic Methods I Lab 1
12
Copyright 2014 by D.S. Guttman and N.J. Provart
Figure 9. GenBank record for NM_128442 mRNA.
Bioinformatic Methods I Lab 1
13
Copyright 2014 by D.S. Guttman and N.J. Provart
Let's return the mRNA record we were previously working with (NM_128442). Click
on the CDS link. Now you are looking at the information for the coding sequence, as
opposed to the whole gene or protein (highlighted in brown ).
Using the Display: FASTA option in the grey bar at the bottom of the page generate a
FASTA-formatted version of the CDS.
Now you have the sequence in the most basic and easily managed format FASTA
format. FASTA format is simply a header line that starts with a > followed by text
describing the sequence, and then the actual sequence beginning on the next line. The
sequence can be either DNA or protein, and may be continuous (scrolling off the page),
or cut into more manageable lengths typically ranging between 60-80 residues.
Figure 10. Sequence in FASTA text format.
2. Lets do some BLASTing. Use the Run BLAST link in the Analyze This Sequence part of
the webpage. [Or open a new tab or window in your browser and go back to the NCBI home
page (www.ncbi.nlm.nih.gov), then select BLAST from the Resources dropdown along the
top, under the DNA&RNA subsection].
There are lots of options here. We will discuss some of these next lab, but right now lets
work with the simplest. We want to do a nucleotide blast.
Bioinformatic Methods I Lab 1
14
Copyright 2014 by D.S. Guttman and N.J. Provart
On the BLAST page, note that under the Enter Query Sequence section, the NCBI
system has automatically entered the accession number (but you can also enter a gi
number, or FASTA sequence). You could also copy-and-paste the FASTA formatted
mRNA sequence you found in the previous step into the query box.
Figure 11. The blastn query page.
Scan the sections of the page. You have quite a bit of control over how the algorithm runs
(particularly if you click Algorithm parameters near the bottom.
We want to query the full NCBI database; the NCBI linking system has automatically
changed the default Database (which is Human) to Other and Nucleotide collection
(nr/nt) because our sequence is non-human. The nr database is the non-redundant
collection of sequences in GenBank.
Change the Program Selected / Optimized for to Somewhat similar sequences (blastn).
Note all the small question mark icons around the page. Click any one of these to find
out more about the associated parameter. For example, by clicking the question mark in
the Program Selection section you get a very brief summary of the different methods.
By clicking more you jump to a new page with full documentation for the algorithms.
a. When would you want to use megaBLAST? What about discontinuous
megaBLAST? (if you have time, try each and see how your results differ)
Bioinformatic Methods I Lab 1
15
Copyright 2014 by D.S. Guttman and N.J. Provart
Figure 12. Algorithm parameters for blastn.
Open the Algorithm Parameters near the bottom.
b. What is the Expect threshold?
c. What would happen if you decreased it? Increased it?
d. What would be the effect of increasing the Word size?
e. Why is there a Low complexity regions filter? Should we keep it on?
Make sure you have your query sequence entered in the input box, and check the box
next to Show results in a new window near the BLAST button. Now (finally) click the
BLAST button.
While BLAST is running or after the search is complete you can choose to adjust the
format of the search results by clicking on the Format options link. We wont do this
right now, as the defaults usually work fine.
Box 5. How Good is My Hit?
The quality of a BLAST HSP is quantified in a number of different ways. It is important that
you understand the differences between these metrics and use the appropriate one.
Identity the extent to which two sequences are invariant. A very poor measure since it
doesnt take into account the subtleties of sequence relationships (e.g. a small region of a
highly conserved domain within two sequences that are otherwise very poorly
conserved).
Bit score the alignment score (S). A very precise measure that is normalized over the
particular score system employed. Suffers from the disadvantage of being dependent on
the length of the query.
E value the expect value. A probability value that is based on the number of different
alignments with scores at least as good as that observed, which are expected to occur
Lab Quiz
Question 2
Bioinformatic Methods I Lab 1
16
Copyright 2014 by D.S. Guttman and N.J. Provart
simply by chance. The lower the E value, the more significant the score. This is by far
the best metric to use since results of different searches in the same database can be
readily compared. Note that E value is dependent on the size of the database (n) and the
length of the query sequence (m). The same sequence searched on different databases
containing identical hit sequences would result in different E values being reported.
E = mn2
-S
Well go into greater detail about this calculation in next weeks class.
3. The Results page is broken up into sections.
At the very top is the job summary, which simply shows details about your query and the
database searched. You can find more details about your search by clicking Search
Summary.
a. How many sequences are in the nr database?
b. What sequences are not included in the nr database?
Figure 13. Blastn output search summary.
Next is the Graphic Summary. Scroll your mouse over the coloured bars.
c. What do the coloured bars mean?
d. How does the colour code work?
e. What information is displayed in the box near the top of the graphic summary?
f. What do you notice about the significance values as you move down the graphical
summary?
g. What is the genus and species of the top (best) hit?
h. What happens if you click on one of the entries?
Bioinformatic Methods I Lab 1
17
Copyright 2014 by D.S. Guttman and N.J. Provart
Figure 14. Blastn output graphic summary.
The Descriptions section is next, listing:
Description
Max Score the alignment bit score
Total Score another alignment bit score which may differ from the Max Score if
your query matched a single database entry in multiple regions.
Query Coverage what percent of the query had similarity to the database hit.
E-value probably the best measure of hit quality. Smaller numbers mean better
hits, with 0.0 being the best value possible.
Identity the highest identity found between query and HSP.
Accession linked to the indicated sequence at NCBI
i. How many sequence matches are listed for this query sequence? How are they
ordered? (you can sort these segments in other ways, like by identity, score, and
query start position.)
j. What happens if you click the Accession hotlink?
k. What happens if you click the Alignments hotlink?
Bioinformatic Methods I Lab 1
18
Copyright 2014 by D.S. Guttman and N.J. Provart
Figure 15. Blastn output descriptions
Finally we get down to the actual HSP Alignments.
Compare the information presented for the first HSP alignment to the first entry in
the graphical summary and HSP summary.
As you scroll down the alignments, you will see the alignment quality drop.
l. What do the vertical bars ( | )represent between the Query and the Sbjct
(database sequence)?
m. What does Strand=Plus/Plus, Strand=Plus/Minus mean? Hint: are genes
always in the same direction on a piece of chromosomal DNA?
Go back to the top of the page and click Formatting options. Change the Alignment
View to Query-anchored with dots for identities. Click Reformat and score down to
the HSP alignment section.
n. Describe the difference between this format and the previous format. Can you
imagine cases where the different formats might be most useful?
o. Play with these format options to get a feel for what they mean.
Lab Quiz
Question 3
Bioinformatic Methods I Lab 1
19
Copyright 2014 by D.S. Guttman and N.J. Provart
Return the formatting to the original Pairwise format. Go back to the graphical
summary. If there are any low-scoring segments (i.e.: green or blue-coded blocks), click
on one.
n. What is its E-value?
o. Does it have a high percent identity? If so, why would BLAST give it such a
low E-value?
p. Do you think these hits are homologous? Why or why not?
Figure 16. Blastn ouput alignments.
End of Lab!
Bioinformatic Methods I Lab 1
20
Copyright 2014 by D.S. Guttman and N.J. Provart
Lab 1 Objectives
By the end of Lab 1 (comprising the lab including its boxes, and the lecture), you should:
know how to search for records at NCBI, both using search terms or identifiers (first part
of lab) and GQuery, or using a nucleotide sequence and BLAST;
know the difference between a GenBank accession number, a version number, and a GI
number;
understand the difference between the nucleotide sequence database part of GenBank and
the protein sequence part of it;
know the parts of a GenBank record and be able to switch between sequence formats
(e.g. to FASTA format);
be familiar with the interconnectedness of various NCBI databases and be able to call up
linked records with ease;
be able to use nucleotide BLAST (Blastn) to search GenBank, and be able to interpret the
output what does the E-value tell you etc.?;
understand the meaning of homologous, orthologous, and paralogous sequences;
be able to use the Help function to address any question you may have with regards to the
NCBI interface (if you have any questions on background material, check in with the
forums for this course on Coursera!).
Do not hesitate to post any questions you might have to the Forum section of the Coursera
website for this course if you do not understand any of the above after reading the relevant
material.
Further Reading
Section I Introduction and Biological Databases in Essential Bioinformatics by Jin Xiong, Cambridge University
Press, 2006. pp 3-27.
SF Altschul , TL Madden , AA Schaffer , J Zhang , Z Zhang , W Miller , and DJ Lipman (1997) Gapped BLAST
and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25: 3389-3402.
NM Luscombe, D Greenbaum, M Gerstein (2001) What is bioinformatics? An introduction and overview.
Yearkbook of Medical Informatics 2001:83.
CA Kerfeld, KM Scott (2011) Using BLAST to Teach E-value-tionary Concepts. PLoS Biol 9(2):
e1001014. http://dx.doi.org/10.1371/journal.pbio.1001014.
Bioinformatic Methods I Lab 1
21
Copyright 2014 by D.S. Guttman and N.J. Provart
Appendix 1: GenBank Field Qualifiers
From http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options
Accession [ACCN]
Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure,
genome record, or PopSet by a sequence database builder. The Structure database accession index contains the PDB
IDs but not the MMDB IDs.
All Fields [ALL]
Contains all terms from all searchable database fields in the database.
Author Name [AUTH]
Contains all authors from all references in the database records. The format is last name space first initial(s), without
punctuation (e.g., marley jf).
EC/RN Number [ECNO]
Number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme
or chemical, respectively.
Feature Key [FKEY]
Contains the biological features assigned or annotated to the nucleotide sequences and defined in the
DDBJ/EMBL/GenBank Feature Table (http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html). Not available
for the Protein or Structure databases.
Filter [FILT]
Contains predetermined or filtered subsets of the various databases. These subsets or filters are created by grouping
records that are commonly linked to other GQuery databases or within the same database. For example, the PopSet
database Filter index includes PopSet all, PopSet medline, PopSet nucleotide, and PopSet protein. The PopSet
medline filter includes all PopSet records with links to PubMed; the PopSet nucleotide filter includes all PopSet
records with links to the nucleotide database; and, the PopSet protein filter includes all PopSet records with links to
the protein database. The PopSet all filter includes all PopSet records.
Gene Name [GENE]
Contains the standard and common names of genes found in the database records. This field is not available in
Structure database.
Issue [ISS]
Contains the issue number of the journal in which the data were published.
Journal Name [JOUR]
Contains the name of the journal in which the data were published. Journal names are indexed in the database in
abbreviated form (e.g., J Biol Chem). Journals are also indexed by their by ISSNs. Browse the index if you do not
know the ISSN or are not sure how a particular journal name is abbreviated.
Keyword [KYWD]
Contains special index terms from the controlled vocabularies associated with the GenBank, EMBL, DDBJ, SWISS-
Prot, PIR, PRF, or PDB databases. Browse the Keyword indexes of the individual databases to become familiar with
these vocabularies. A Keyword index is not available in the Structure database.
Modification Date [MDAT]
Contains the date that the most recent modification to that record is indexed in GQuery, in the format
YYYY/MM/DD (e.g., 1999/08/05). A year alone, (e.g., 1999) will retrieve all records modified for that year; a year
and month (e.g., 1999/03) retrieves all records modified for that month that are indexed in GQuery.
Molecular Weight [MOLWT]
Molecular weight of a protein, in Daltons (Da), calculated by the method described in the Searching by Molecular
Bioinformatic Methods I Lab 1
22
Copyright 2014 by D.S. Guttman and N.J. Provart
Weight section of the GQuery help document. Note that molecular weight must be entered as a fixed 6 digit field,
filled with leading zeros (not letter O), e.g., 002002 [MOLWT]
Organism [ORGN]
Contains the scientific and common names for the organisms associated with protein and nucleotide sequences.
Page Number [PAGE]
Contains the number of the first journal page of the article in which the data were published.
Primary Accession [PACC]
Contains the primary accession number of the sequence or record, assigned to the nucleotide, protein, structure,
genome record, or PopSet by a sequence database builder. A Primary Accession index is not available in the
Structure database.
Properties [PROP]
Contains properties of the nucleotide or protein sequence. For example, the Nucleotide database's Properties index
includes molecule types, publication status, molecule locations, and GenBank divisions. A Properties index is not
available in the Structure database.
Protein Name [PROT]
Contains the standard names of proteins found in database records. Common names may not be indexed in this field
so it is best to also consider All Fields or Text Words. A Protein Name index is not available in the Structure
database.
Publication Date [PDAT]
Contains the date that records are released into GQuery, in the format YYYY/MM/DD (e.g., 1999/08/05). It is the
date the entry first appeared in GenBank explicitly indexed in GQuery. A year alone, (e.g., 1999) will retrieve all
records for that year; a year and month (e.g., 1999/03) will retrieve all records released into GenBank for that month.
SeqID String [SQID]
Contains the special string identifier, similar to a FASTA identifier, for a given sequence. A SeqID String index is
not available in the Structure database.
Sequence Length [SLEN]
Contains the total length of the sequence. Sequence Length indexes are not available in the Structure or PopSet
databases.
Substance Name [SUBS]
Contains the names of any chemicals associated with this record from the CAS registry and the MEDLINE Name of
Substance field. Substance Name indexes are not available in the Genome or PopSet databases.
Text Word [WORD]
Contains all of the "free text" associated with a record.
Title Word [TITL]
Includes only those words found in the definition line of a record. The definition line summarizes the biology of the
sequence and is carefully constructed by database staff. A standard definition line will include the organism, product
name, gene symbol, molecule type and whether it is a partial or complete cds. Title Word indexes are not available
in the Structure or PopSet databases.
Uid [UID]
Contains the Medline unique identifier for records that contain published references that are linked to PubMed. The
Uid index is not browsable.
Volume [VOL]
Contains the volume number of the journal in which the data were published.