Bioinformatics Class Notes
Bioinformatics Class Notes
Bioinformatics Class Notes
10/6/00
What is Bioinformatics?
In the last few decades, advances in molecular biology and the equipment available for
research in this field have allowed the increasingly rapid sequencing of large portions of
the genomes of several species. In fact, to date, several bacterial genomes, as well as
those of some simple eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast) and
more complex eukaryotes (C. elegans and Drosophila) have been sequenced in full. The
Human Genome Project, designed to sequence all 24 of the human chromosomes, is also
progressing and a rough draft was completed in the spring of 2000.
Popular sequence databases, such as GenBank and EMBL, have been growing at
exponential rates. This deluge of information has necessitated the careful storage,
organization and indexing of sequence information. Information science has been applied
to biology to produce the field called bioinformatics.
the development of new algorithms and statistics with which to assess relationships
among members of large data sets;
the analysis and interpretation of various types of data including nucleotide and
amino acid sequences, protein domains, and protein structures; and
the development and implementation of tools that enable efficient access and
management of different types of information.
One of the simpler tasks used in bioinformatics concern the creation and maintenance of
databases of biological information. Nucleic acid sequences (and the protein sequences
derived from them) comprise the majority of such databases. While the storage and or
ganization of millions of nucleotides is far from trivial, designing a database and
developing an interface whereby researchers can both access existing information and
submit new entries is only the beginning.
The most pressing tasks in bioinformatics involve the analysis of sequence information.
Computational Biology is the name given to this process, and it involves the following:
ORF prediction and gene identification (see handout for eukaryotic gene
organization)
Search databases for potential protein function or homologue
Protein structure prediction and multiple sequence alignment (conserved regions)
Analysis of potential gene regulatory elements
Gene knockout or inhibition (RNA interference) for phenotypic analysis
Overview of Sequence Analysis (see handout)(Figure: Requires Adobe Acrobat)
While most biological databases contain nucleotide and protein sequence information,
there are also databases which include taxonomic information such as the structural and
biochemical characteristics of organisms. The power and ease of using sequence
information has however, made it the method of choice in modern analysis.
In the last three decades, contributions from the fields of biology and chemistry have
facilitated an increase in the speed of sequencing genes and proteins. The advent of
cloning technology allowed foreign DNA sequences to be easily introduced into bacteria.
In this way, rapid mass production of particular DNA sequences, a necessary prelude to
sequence determination, became possible. Oligonucleotide synthesis provided researchers
with the ability to construct short fragments of DNA with sequences of their own
choosing. These oligonucleotides could then be used in probing vast libraries of DNA to
extract genes containing that sequence. Alternatively, these DNA fragments could also be
used in polymerase chain reactions to amplify existing DNA sequences or to modify
these sequences. With these techniques in place, progress in biological research increased
exponentially.
For researchers to benefit from all this information, however, two additional things were
required: 1) ready access to the collected pool of sequence information and 2) a way to
extract from this pool only those sequences of interest to a given researcher. Simply
collecting, by hand, all necessary sequence information of interest to a given project from
published journal articles quickly became a formidable task. After collection, the
organization and analysis of this data still remained. It could take weeks to months for a
researcher to search sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can
computers be used to store and organize sequence information into databases, but they
can also be used to analyze sequence data rapidly. The evolution of computing power and
storage capacity has, so far, been able to outpace the increase in sequence information
being created. Theoretical scientists have derived new and sophisticated algorithms
which allow sequences to be readily compared using probability theories. These
comparisons become the basis for determining gene function, developing phylogenetic
relationships and simulating protein models. The physical linking of a vast array of
computers in the 1970's provided a few biologists with ready access to the expanding
pool of sequence information. This web of connections, now known as the Internet, has
evolved and expanded so that nearly everyone has access to this information and the tools
necessary to analyze it.
Searching databases to identify sequences and predicting functions or properties of
predicted proteins
Searching by keyword, accession #, etc.
Searching for homologous sequences
o see Blast at the NCBI
BLAST (Basic Local Alignment Search Tool) is a set of similarity
search programs designed to explore all of the available sequence
databases regardless of whether the query is protein or DNA.
Luckily, in agreement with evolutionary principles, scientific research to date has shown
that all genes share common elements. For many genetic elements, it has been possible to
construct consensus sequences, those sequences best representing the norm for a given
class of organisms (e.g, bacteria, eukaroytes). Common genetic elements include
promoters, enhancers, polyadenylation signal sequences and protein binding sites. These
elements have also been further characterized into further subelements.
Genetic elements share common sequences, and it is this fact that allows mathematical
algorithms to be applied to the analysis of sequence data. A computer program for
finding genes will contain at least the following elements.
see the following at the NCBI
BLAST
What is Similarity Searching?
BLAST Information and Tutorial Links
References for Database Searching
Sample Sequences with some Instruction for Searching the NCBI Database can be
found here
Algorithms for
Probability formulae are used to determine if two sequences are
pattern
statistically similar.
recognition
Identification of the location of the transcription start and stop sites. A proper
analysis to locate a genetic locus will usually have already pinpointed at least the
approximate sites of the transcriptional start and stop. Such an analysis is usually
insufficient in determining protein structure. It is the start and end codons for translation
that must be determined with accuracy for prediction of the protein encoded.
Identification of the translation initiation and stop sites. The first codon in a
messenger RNA sequence is almost always AUG. While this reduces the number of
candidate codons, the reading frame of the sequence must also be taken into
consideration.
There are six reading frames possible for a given DNA sequence, three on each strand,
that must be considered, unless further information is available. Since genes are usually
transcribed away from their promoters, the definitive location of this element can reduce
the number of possible frames to three. There is not a strong concensus between different
species surrounding translation start codons. Therefore, location of the appropriate start
codon will include a frame in which they are not apparent abrupt stop codons.
Knowledge of a proteinÕs predicted molecular mass can assist this analysis. Incorrect
reading frames usually predict relatively short peptide sequences. Therefore, it might
seem deceptively simple to ascertain the correct frame. In bacteria, such is frequently the
case. However, eukaryotes add a new obstacle to this process: INTRONS!
Prediction of Protein 3-D Structure. With the completed primary amino acid sequence
in hand, the challenge of modelling the three-dimensional structure of the protein awaits.
This process uses a wide range of data and CPU-intensive computer analysis. Most often,
one is only able to obtain a rough model of the protein, and several conformations of the
protein may exist that are equally probable. The best analyses will utilize data from all
the following sources.
All of this information is used to determine the most probable locations of the atoms of
the protein in space and bond angles. Graphical programs can then use this data to depict
a three-dimensional model of the protein on the two-dimensional computer screen.
Other Bioinformatics Links Discussed or Illustrated in Lecture are Available here
Entrez is a search and retrieval system that integrates information from databases at
NCBI. These databases include nucleotide sequences, protein sequences, macromolecular
structures, whole genomes, and MEDLINE, through PubMed. An Entrez Help Page is
available for the Entrez System. For information on the databases that are including in
Entrez see the NCBI Database Index Page.
GenBank is the NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences. GenBank (at NCBI), together with the DNA DataBank of
Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the
International Nucleotide Sequence Database Collaboration. These three organizations
exchange data on a daily basis.GenBank grows at an exponential rate, with the number of
nucleotide bases doubling approximately every 14 months. Currently, GenBank contains
nearly 3 billion bases from over 47,000 species.
The medical citations available through WWW Entrez include the entirety of the
MEDLINE
database, plus citations from journals not inddexed by MEDLINE. This set
contains
approximately 8.7 million citations. In addition, the full text of many journals'
articles are
available through PubMed by means of links to the publishers of those journals.
Many of the PubMed citations in WWW Entrez contain protein or nucleotide
sequence
information; these citations are linked to their constituent sequences via Entrez,
permitting
you to move easily from a view of the PubMed abstract to one of the sequence(s)
it contains.
In addition, each PubMed citation has been compared to many of the other
PubMed citations
in the Entrez database using an algorithm that evaluates the similarity of the text
and MeSH
terms found in those citations. The most similar citations, called "neighbors", can
then be
viewed through Entrez. This facility permits you to find a large number of
citations that fall
into your area of interest once you have found a few relevant citations, and
increases the
power of your searching dramatically.
The Protein and Nucleotide databases
The Protein and Nucleotide entries in Entrez have been compiled from a variety
of sources,
including GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB. NCBI
has made
strong efforts to cross-reference the sequences in these databases in order to avoid
duplication.
Many of these entries are found in PubMed citations in the Entrez database; the
PubMed citations
related to any given sequence can be viewed through Entrez, as per above.
The sequences in these databases are compared against one another using the
BLAST algorithm for
finding ungapped local alignments. Sequences which seem highly similar to the
one you are viewing
can then be retrieved through Entrez, in much the same way as PubMed citations.
This will retrieve
most biologically significant similarities, but will miss a few and will include
some chance similarities.
Nucleotide neighbors are most successfully used to build contigs rather than
discern biological function.
The MMDB 3D structures database
The NCBI taxonomy database contains the names of all organisms that are
represented in the genetic databases with at least one nucleotide or protein
sequence. There are also direct links to some of the organisms commonly used in
molecular research projects.
The OMIM Database