Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
M.SC ZOOLOGY
BIF401-Bioinformatics-I
These are Highlighted HANDOUTS
MID_TERM Syllabus Highlighted
by BINT_E_HAWA
MUHAMMAD IMRAN
1 Background of bioinformatics
BACKGROUND
Bioinformatics is an interdisciplinary science at the cross-roads of biology, mathematics, computer
science, chemistry and physics. With the digitalization of the biological information, doors have been
wide opened towards the analysis of this information using computer algorithms and software.
Now we know well that the human genome has over 25,000 genes and these genes code for
thousands of different proteins which perform day-to-day functions in the living cell. Furthermore,
these proteins may take on various post-translational modifications leading to a very large number
of functionally unique molecules. This presents us with a huge challenge in identification of genes
and proteins.
EXPERIMENTS IN BIOLOGY
With the advancements in experimental protocols, now we have several next generation
instruments and techniques available for obtaining digitalized biological information on genes and
proteins etc.
DIGITALIZATION OF BIOLOGY
In today’s world, when a biologist performs an experiment in the wet-lab, he or she in fact produces
digital data which is continuously being stored on computer disks. The data may include text,
numbers, symbols or images.
CONCLUSION
Human brain is limited in recalling information from memory. First, we have to commit all information
to our memory followed by its recall. To overcome our ability to memorize and recall, computers can
come to our rescue. This is because computers have an infinite ability to recall this information and
process it quickly towards results.
2 Introduction to bioinformatics
MOTIVATION
Bioinformatics is a becoming a popular science due to several reasons.
SCOPE OF BIOINFORMATICS
Bioinformatics primarily deals with digitalized biological information as well as data reported from
biology experiments. Computational methods, data processing techniques and algorithms are
employed in addressing the following issues:
Storage of data
Organization data
Analysis of many experiments
For representation of biological information
ACTIVITIES
In modern biological sciences, bioinformatics is used for activities such as:
CONCLUSION
In Pakistan, the field of biology is undergoing a rapid change due to the onset of bioinformatics.
New research and educational programs are being constructed which is opening new door of
opportunities for our future generations.
The need for bioinformatics is on a rapid rise as biological data is rapidly increasing and becoming
available online, free of any cost.
If we observe the growth of gene bank than from 1982 it comprised of 2 billion base pairs but by year
2002 it had risen to 56 billion base pairs. With the data in our hands, there is an urgent need to interpret
this data. For instance, analysis of this data can help us in developing an understanding of the
phylogenetic “tree of life” which consist of:
Bacteria
Archaea
Eucarya
Towards exploring the possible benefits of using bioinformatics, one needs to answer the following
question:
POSSIBLE CONTRIBUTIONS
It can help us to organize the large datasets from new experiments instruments
Bioinformatics can help store and process this data as well.
It can provide insights into the meanings of our research results and findings.
MUHAMMAD IMRAN 2
Overall, it can help us to better understand paradoxes defining the life forms.
CONCLUSION
From gene sequencing to protein sequencing, bioinformatics is providing us with an improved
understanding of the genes, proteins, protein interaction and signaling pathways involved in biological
functioning and disease.
5 Applications of Bioinformatics – I
When we look at bioinformatics, it seems to be a very complex and abstract field. How and where can
bioinformatics be applied specifically? How does it improve the fundamental understanding of biological
phenomenon? Most importantly, how can its benefits be delivered to the society at large?
The answers to these questions are categorized as follows:
GENOMICS
Bioinformatics can help in assembling DNA sequencing data.
It can help in gene finding (markers).
Gene assembly can be performed using bioinformatics tools (nucleotide alignments)
It can help transcribe the gene data to RNA data
Also, databases can be generated from such data.
EVOLUTIONARY STUDIES
Evolutionary relationships between different organisms can be derived from data.
Evolutionary distance among species can be computed by using bioinformatics tools.
Phylogenetic trees can be constructed to find relationships between species.
Ancestry can be better understood between several species and organisms.
PROTEOMICS
Bioinformatics can help us in decoding protein sequences.
It can also help us in understanding protein structure.
We can also understand post translational changes in proteins with the help of bioinformatics.
We can better understand the protein-protein interaction in different biological reactions.
It can also help us in generating databases of these sequences and structures.
SYSTEMS BIOLOGY
Bioinformatics can assist us in modelling regulatory mechanisms in gene and protein
networks.
Such models can be analyzed to identify the key regulators in these networks.
Moreover, the models can help evaluate drugs to treat these key regulators.
CONCLUSION
Bioinformatics can be applied to life in many ways it helps us to understand the sequence and
function of biomolecules and their relationships. Recent trends in bioinformatics involve development
of personalized therapeutics for cancer and diabetes.
6 Applications of Bioinformatics – II
Bioinformatics is being applied in routine life in many ways like in Genomics, transcriptomics,
Proteomics, Metabolomics, Structural Proteomics, Designing Drugs, System Biology and in
personalization of medicines for cure.
Except these applications Bioinformatics introduced us the techniques which enabled us to generate
the large data regarding biology and also its use. And step by step the applications of bioinformatics
increased from genomic level to entire system level.
SMALL TO BIG
MUHAMMAD IMRAN 3
Bioinformatics helps us to understand the systems from small to big like from gene findings to
entire system prediction
In structure findings and modeling of many biological system to understand them in better
ways.
Bioinformatics helped the human to understand the protein, protein interaction in many
biological systems.
And provide us the concept how these biological process are interconnected with each other
and how they affect each other.
Now we are able to understand the modeling of molecules and genome at cell level.
Signaling pathways are easy just because of bioinformatics.
Now morphology of tissue can be understand by creating the models with help of
bioinformatics tools.
CONCLUSION
Bioinformatics not only just collect, analyze and store the data it process it in very authentic way and
validates our hypothesis and very soon in future it will help us to understand that which disease is
coming in future and how to tackle it with personalize medicine.
7 Frontiers in Bioinformatics – I
INTROCDUCTION
Bioinformatics is new and emerging field of science having vast opportunities and with innovation in
tools it is increasing the scale of biological data, but still there are many unsolved challenges which
are pending in the field of life science and for which bioinformatics is doing new innovative ideas.
FRONTIER IN GENOMICS
Now we are able to sequence the whole genome with the bioinformatics tool of Next generation
sequencing (NGS)
We are able to save, store and analyze the massive amount of biological data which is in (Terabyte
files)
We can handle the large number of data easily and can process it as well in easy way.
Whole genome can be assemble in sequence and can flaws can be identified easily.
FRONTIER IN TRANSCRIPTOMICS
Now in genomics we are able to identify those matters which are unknown yet or under discussion.
Role of RNA in making proteins and its dynamics can be understood easily now.
Interactions of RNA molecule can be easily understood by simple model.
FRONTIER IN PROTEOMICS
Bioinformatics is literally a science full of challenges and opportunities having a revolution in field
of biology and routine life.
8 Frontiers in Bioinformatics – II
Frontier in Bioinformatics includes
MUHAMMAD IMRAN 4
FRONTIER IN PROTEIN STURUCTURE
Bioinformatics helps us to understand the layer folding of proteins that how they are processed, and
helps to know that how protein interact with each other and how a drug can affect or stimulate a
protein.
FRONTIER IN SYSTEM BIOLOGY
It helps us to understand the whole system of a single cell, in that cell how organelles, gene, proteins
and metabolites are interconnected in a single unified system (cell). And bioinformatics also give us
the idea how these models can be applied to real-time.
FRONTIER IN PERSONALIZED MEDICINE
This is the important thing for this century and upcoming generation that personalize the medicine for
exact cure of a disease. Because all the medicine cannot work exact some effect patient badly
therefor with the help of Bioinformatics we are now able to personalize some medicines for some
diseases. And bioinformatics helps us to evaluate the medicine.
CONCLUSION
If we talk about the 21st century than it’s the century of bioinformatics it will enable the human to cure
many disease with one drug by personalizing it.
Sequences and operations such as alignment and comparison will be covered along with
phylogenetic and RNA structure modelling. Next up we will delve into protein sequences and
structures!
MUHAMMAD IMRAN 5
10 Overview of Course Contents – II
Summary
• Protein sequence and structure topics will be dealt in these modules
• Next set of modules is about the homology modelling and systems biology topics!
MUHAMMAD IMRAN 6
11 Overview of Course Contents – III
Conclusion
• These contents will give you an initial exposure to the variety of topics in bioinformatics
• After covering these topics, you should have a basic conceptual foundation for further
studies into Bioinformatic
INTRODUTION
We all know that all the living things are composed of cells and here a question arise that how cells
are made? For composition of cell DNA has blueprints for building cells along with the information of
cell’s protein, carbohydrate and vitamins production.
And transfer of this information from DNA to these molecules is termed as “Central Dogma” which is
DNA RNA Protein.
Proteins are than use in constructing the cell.
DNA
MUHAMMAD IMRAN 7
Figure 0.1 DNA Double helix
DNA molecule is double helix structure contain base pairs composed of nucleotides and these
nucleotides are composed of sugar phosphate group and are bind with each other with hydrogen
bonds.
Normally all the nucleotides are same in both DNA and RNA except one position in RNA which is U
(Uracil) and in DNA it is T (Thiamin)
DNA sends the information to cell via mRNA and that sequence the amino acids according to coded
information and protein structure is formed and that protein form a cell.
CONCLUSION
According to the central dogma DNA codes information for RNA and RNA makes the Protein and
that protein along with some organelles make cells and its systems.
13 Transcription
All cells are made of carbohydrates and proteins and for these cells DNA codes the information
which makes the RNA and protein both.
Transcription Translation
Copy of Execution of
Information
Information Information
The above mechanism explains the process of transcription in very simple way, DNA codes the
information and converted into RNA where mRNA copies the information and it execute the
information in cell and amino acids combine with each other according to coded information of DNA
and protein formation takes place. Which is known as Translation.
MUHAMMAD IMRAN 8
Molecule of DNA contains only four base pairs (A, T, C, and G) which are repeated thousands of
time and Adenine “A” pairs with Cytosine “C”, While Thymine “T” binds with Guanine “G” and all
pairings are with the help of Hydrogen bonding.
Same like DNA, the RNA contains four base pairs but Thymine is replaced with Uracil “U” and RNA is
single stranded.
DNA just codes the information for protein but RNA helps in making protein.
14 Nucleotides
If we talk about the composition of DNA and RNA molecule than these are composed of four other
molecules which are named as Nucleotides.
These molecules are Adenine (A), Cytosine (C), Thymine (T), Uracil (U), and Guanine (G).
DNA molecule although is double stranded and RNA is single stranded but there is difference in
sugar composition.
RNA has Ribose sugar and DNA has de-oxyribose sugar:
RNA DNA
Adenine and Guanine collectively called Purines while Cytosine, Uracil, and Thymine are called as
Pyrimidine.
when phosphate, nitrogen base and sugar come together if there is (OH) than molecule is RNA and if
there is (H) in sugar than molecule is DNA. As figure shows.
CONCLUSION
DNA molecule make RNA and RNA make the protein and DNA differ from RNA in nature due to
sugar and nucleotide.
15 Translation
Cells are built of proteins and carbohydrates and these proteins are made in results of
transformation of RNA molecule and this transformation is called as translation.
Translation takes place in ribosome of cell and ribosomes after reading the information of mRNA
collects the amino acids from cell cytosol which is the part of the cytoplasm that is not held by any of
the organelles in the cell.
MECHANISM
MUHAMMAD IMRAN 9
At ribosome three nucleotides are read at a time from mRNA, this set of three nucleotide is called as
codon and each codon correspond to a specific amino acid.
CONCLUSION
RNA codes for protein and codons of here nucleotide code for specific amino acid on ribosomes and
this process is called as translation.
16 Amino Acids
RNA decodes the information at ribosomes in form of Codons each codon select a specific amino
acid. Because there are 20 different amino acids in nature therefore they fold together and make a
protein structure by polymerizing themselves.
If we observe the structure of amino acid it contains nitrogen, hydrogen, oxygen and two carbon
atoms. And a variable group R.
When polymerizations takes place water is formed and if any compound attached with R group than
structure of protein is changed.
MUHAMMAD IMRAN 10
These amino acids are joined with each other with peptide bonds and fold with each other in 3D
form they make protein structure.
MUHAMMAD IMRAN 11
18 Using Entrez ( GenBank)
GenBank is online database where researcher can get access to the sequences of DNA, RNA and
proteins.
To find any sequence we go online to NCBI GenBank website which is Public database site.
Which is; www.ncbi.nlm.nih.gov/genbank
And for example we want to find the sequence for Immunoglobulin which is responsible for
Glycoprotein antibodies in white blood cells plasma and act for immunity.
MUHAMMAD IMRAN 12
Sequences can be searched from GenBank by typing;
Sequence name
ID
Name
Species
Locus
Accession Number
Author
Journal
19 Using Uniprot
UniProt is public database which is being used to search the sequence of proteins.
www.Uniprot.org
MUHAMMAD IMRAN 13
For example we want to search a sequence of a protein which is Ubiquitin which plays an important
role in cytosol for recycling the proteins. We have to go online to the website www.Uniprot.org and
above page will appear.
We have to write the name of protein in search box and press enter. You will get the searched results
like this one.
20 Comparing Sequences
MUHAMMAD IMRAN 14
There are millions sequences on GenBank and UniProt what will happen if we will compare them?
By comparing sequences of DNA, RNA and Proteins we can get
Not only should the two or more sequences being compared have the exact same number of
each nucleotide (for DNA /RNA) or amino acid (for Proteins), but that they should be arranged
in the exact same order!
CONCLUSION
If there is exact match in sequences it means their order or arrangement and maximum numbers of
nucleotides match to each other not all of those.
MUHAMMAD IMRAN 15
While the genome of each created kind is unique, many animal kinds share some specific types of
genes that are generally similar in DNA sequence. When comparing DNA sequences between animal
taxa, evolutionary scientists often hand-select the genes that are commonly shared and more similar
(conserved), while giving less attention to categories of DNA sequence that are dissimilar. One result
of this approach is that comparing the more conserved sequences allows the scientists to include
more animal taxa in their analysis, giving a broader data set so they can propose a larger
evolutionary tree.
Although these types of genes can be easily aligned and compared, the overall approach is biased
towards evolution. It also avoids the majority of genes and sequences that would give a better
understanding of DNA similarity concepts.
http://www.icr.org/article/common-dna-sequences-evidence-evolution/
1. Global
2. Local
In Global ways of sequence pair alignment we introduce the Gaps in all sequence to know over all
matching. While in Local type of sequence pair alignment we find those regions where nucleotides
are maximum matching with each other it is used to find the similarity or some nutation.
Most important the Gaps are introduced so that we may add the missing nucleotides.
Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional,
structural and/or evolutionary relationships between two biological sequences (protein or nucleic
acid).
MUHAMMAD IMRAN 16
23 Pairwise Sequence Alignment-II
Pair wise alignment helps us to find the similarity and differences there are three ways according to
which sequences can differ from each other.
Salient Points
• Sliding the sequences past each other
• Aim is to maximize matches
• Gaps (‘.’) could be inserted to account for insertions and deletions
• Gaps (‘.’) may carry a penalty for reducing scores from unreasonable alignments
• Without a gap penalty an exact match and match with gaps will get equal scores
Global alignment - maximizes the number of matches between the query and source sequences
along the entire length of both the sequences.
Local alignment - gives the highest scoring local match between both query and sequences.
Optimal alignment - one that exhibits the most correspondences between the query and the source
sequences. It is the alignment with the highest score. Biologically meaningful?
Substitutions ACGA
AGGA
Insertions ACGA
ACCGA
Deletions ACGA
AGA
By applying all above ways to any sequence the matching and mismatching can be increased or
decreased between to different comparing sequencing.
Both local and Global ways of alignments give us different results.
But among above Substitution increases mismatch of sequence.
MUHAMMAD IMRAN 17
25 Dot Plots
To visualize the sequence alignment we have a method called Dot Plots in this method the
sequence is written top and left side of dot matrix grid.
A C G C G
Where one nucleotide or amino acid match with each other the dot is placed in grid position in each
row for one time.
Similar dots are match with diagonal pattern and which remain separate differ from similar
sequence
MUHAMMAD IMRAN 18
Figure 0 . 7 dot plot diagonal pattern
Dots on diagonal repeats the alignments and separate one give difference to the sequence.
In dot plot the matching nucleotides are connected in diagonal way and represent the sequence
alignments.
When we compare the human Cytochrome and Tuna Fish Cytochrome than the diagonal
alignment of sequence we find is in this below diagram.
BENEFITS
Dot plots provides us the Global similarity between the two sequences and helps us to visualize the
alignments of sequences and sequence repeats appear as diagonal stacks in plot.
CONCLUSION
Dot plot help us to find the threshold difference among two sequences.
1. Identity
2. Similarity
Identity means the counting number of nucleotides or amino acids which exactly match when two
biological sequences are matched.
MUHAMMAD IMRAN 19
For example:
1: CATGCTT
2: CATGC
Number of match = 5
Smaller length =5
Sequence (1) =7
Sequence (2) =5
And Similarity means the comparison between two different sequences calculated by alignment
approach.
In both identity and similarity the dots are not counted.
1. Global Alignment
2. Local Alignment
In Local alignment we compare one whole sequence with the one portion of other like this.
While in Global alignment we compare both sequence from end to end completely.
Local alignment just focus on highly matching portions of sequence while in Global one whole
sequence is compared with other one.
MUHAMMAD IMRAN 20
DOMAIN SHUFFLING
Aligned portions of sequence can be considered in varying orders and this process is called as
domain shuffling.
ADVENTAGES
30 Aligning In-dels
Insertion means addition of amino acids in protein sequence and addition of nucleotides in DNA
sequences.
And deletion means removal of amino acids from protein sequence and removal of nucleotides from
DNA or RNA sequences.
ALIGNING INSERTION
For example we have following two sequences
1: A C T G A C T G 1: A C T G A C T G
2: A C G A C T G 2: A C G A C T G
To add the nucleotide in sequence 2 we will add gap first. And same happens with the deletion
alignment we add gap where we delete the nucleotide from sequence. And such insertion of gap is
called as –ve or plenty.
1: A C T G A C T G
2: A C . G A C T
1: A C T G A C T
2: A C G G A C T
1: A C T G A C
MUHAMMAD IMRAN 21
2: A C G G A C
CONCLUSION
In identity alignment we use Gaps and in mutation we use substitution penalties and penalties
depend upon the substitution.
MUHAMMAD IMRAN 22
Conclusion
To find matching in nucleotides and amino acids of two sequences we use dot plot method. But dot
plot cannot capture the insertions, deletions and gaps in the sequences.
To deal with this situation we modify the dot plot.
We represent the matching nucleotides with +1 while gaps, substitutions, insertions and mutations
can be represented as -1 in dot plot.
Dynamic programming is an algorithmic technique used commonly in sequence analysis. Dynamic
programming is used when recursion could be used but would be inefficient because it would
repeatedly solve the same sub problems.
MUHAMMAD IMRAN 23
In algorithm we calculate the step involve in sequence compression for example if we if we compare
two sequences of length “n” than it would be “n2 “
Figure 0.11 -1 represent deletion, insertion and gaps while +1 represent matching nucleotides or
amino acids
One by one sequence compression is costly and time consuming process we minimize the cost with
the help of algorithm.
MUHAMMAD IMRAN 24
MUHAMMAD IMRAN 25
Conclusion
• Alignments are represented by diagonals in the dot matrix plot
MUHAMMAD IMRAN 26
Figure 11 Needleman-Wunsch pairwise sequence alignment
Alignments are represented by unbroken diagonal dot matrix plot. In this way we can create
numerous combinations.
MUHAMMAD IMRAN 27
If the sequence is too long then there will be many diagonal alignments and at the end we select the
best alignment by combinations of all. And for this we use Needleman Algorithm In Needleman
Algorithm we use 0, 0 in first row and first column.
Left to right and top to bottom the best element (having high score) is selected.
Figure 15 maximum score element is selected from all three sides comparison
MUHAMMAD IMRAN 28
• Top, Left and Diagonal elements are considered to calculate an element in the
matrix
Conclusion
MUHAMMAD IMRAN 29
• We are trying to use the best score from Top, Left and Diagonal elements
• This strategy will be very useful in selecting the best combination of alignments!
Top left and diagonal element are considered to calculate an element in the matrix. Match, mismatch
and gap penalty is computed from all there sides (Left to right) (Top to bottom) and (Diagonal).
For example:
DNA, RNA and Protein sequences can be computed by using Needleman algorithm.
39 Backtracking Alignments
Background
• Top, Left and Diagonal elements are considered to calculate an element in the
matrix
• Match, mismatch and gap penalty is used to compute score from each position!
Introduction
• How can we use the Needleman Wunsh algorithm to find the optimal alignment?
MUHAMMAD IMRAN 30
• The solution lies in a method called the “Traceback”
Conclusion
• After completely calculating the matrix, we need to do a “traceback”
• Traceback is such that we start from the bottom right and select the element from
top, left or diagonal which led to the starting element
MUHAMMAD IMRAN 31
• But, is it the local or global alignment?
The difference?
• Needleman Wunsh algorithm traceback begins from the bottom right element
• Can we start from any position within the alignment matrix to avoid gap regions?
MUHAMMAD IMRAN 32
Conclusion
• Traceback strategy allows us to differentiate between a local and a global
alignment
41 Overlap Matches
Dot plot and Needleman wucsch are algorithm method with little difference. Dot plot help us in
finding matching residues of two sequences while Needleman wunsch helps us to find the global
alignments.
If some sequences have different regions of nucleotides which does not match to any other for that
alignment we prefer Global alignment not local, but that does not penalize leading or trailing end.
Figure 21 leading and trailing edge mismatches versus global alignment by gap-insertion (stretching)
of sequences
And “Traceback” is the technique by which we can check the sequences from any end of the matrix
box. And such “Tracebacks” helps us to find the overlaps in aligned sequences.
MUHAMMAD IMRAN 33
Figure 23 Traceback in amino acid sequence alignment
Scoring stagey is:
Match = +2. Mismatch = -1, Gap = -2 Sequences are:
Conclusion
• Local Alignments can identify exons which are present in both sequences
• Proteins of different kind and of different species often exhibit local similarities
• Hence, local similarities may indicate ”functional subunits”
DNA has coding and noncoding regions. Coding regions are called “EXON” expressed as protein
and they remain more conserved due to their role in making functional proteins.
And noncoding regions of DNA are called as “INTRONS” which are more likely involved in mutations
than coding ones. It means high degree of alignment can be find among two exons.
In local alignment we use small segments of sequences and through which we can find exons.
Through this we can find “functional subunits”.
However, the term exon is often misused to refer only to coding sequences for the final protein. This
is incorrect, since many noncoding exons are known in human genes.
(Zhang 1998)
MUHAMMAD IMRAN 34
Zhang, M. Q. (1998). "Statistical features of human exons and their flanking regions." Human
molecular genetics 7(5): 919-932.
In global alignment we compare the sequence from end to end but in local alignment we compare
the sequences in segments.
For Global sequences we use Needleman and Wunsch algorithm while for local pairwise alignment
we use Smith and waterman.
MUHAMMAD IMRAN 35
C[i 1, j 1] score i, j
C i 1, j
C i, j max
C i, j 1
0
And in the matrix we place top line of zero and first Colum of zero.
Figure 25 top line and first Colum are filled with zero in Smith Freshman Algorithm
Local alignments can be extracted by starting from a high score till reaching ‘0’
46 Repeated Alignments
We can find the best local alignments by using Smith Waterman algorithm.
MUHAMMAD IMRAN 36
By making some change in strategy of traceback we can find the repeated sequences.
We use threshold “T” score for matching and it avoids low scoring local alignment. And traceback can
help us to find multiple aligned regions in multiple ways.
This threshold scoring method with some modifications in waterman algorithm can help us to find
many matching sequence of amino acids or DNA.
Slight modification in waterman model can help us to find the Exons as well as the functional units in
any sequence. Matches should be end at the threshold score or we should keep track of maximum
score in sequence.
Figure 28 Trackback from different sides to find maximum or Threshold (T) Score.
Traceback should start from last element of the row and should reach at the top of row element and
then move to the highest score of the Column. And this traceback is done twice and end at the point
where score become “0, 0”
48 Review of Traceback Strategies for Global, Overlap, Local and Repeated Alignments
MUHAMMAD IMRAN 37
Background
• We have seen how biological sequences can be searched and compared using various
recurrence relations and traceback strategies
• Let’s review them!
Conclusion
MUHAMMAD IMRAN 38
• Slight modification in the recurrence relation and changes to the traceback strategy can
help compare sequences in a variety of ways
• DynamicPrograming solves this problem in quick time!
Optimal Alignments
Best Alignment
Scoring scheme used in sequences matches play crucial role in producing optimal alignment. An
optimal alignment should be:
CONCLUSION
Statically we can better align any sequence of protein or DNA, optimal gaps, penalties, insertions and
deletions can be computed statically better.
Score of match and mismatch both are equally observed while sequence alignments.
For example:
The matrix has positive and negative scores both, matches and mismatches therefore are all
considered because it’s a diagonal pattern.
If we build such scoring matrixes with matches and mismatches we can we can sequence in
according to real life.
51 Scoring matrices
Alignments are used to align the biological sequences. Amino acids and nucleotides are more easily
substituted because they have similar chemical nature.
As amino acids are substituted with many probabilities that’s why we need flexible scoring. And we
use Scoring Matrices contain such flexible scoring during alignment.
MUHAMMAD IMRAN 39
To build the Scoring Matrices we analyze the amino acids and nucleotides which are substituted in
single gene and protein sequence.
Scoring Matrices have both values +ve and –ve. Positive value for matches and negative value for
mismatches.
MUHAMMAD IMRAN 40
When we compare the sequences they match and mismatch according to their frequency.
For example.
Based on frequencies we match and mismatch the sequence alignments for scoring.
53 PAM Matrices
Alignment matrices scoring is very useful method to score the sequences alignment for match and
mismatch.
There are two types of scoring matrices.
PAM
BLOSUM
PAM means “Point Accepted Mutation”
Point accepted mutations means the substitution of one amino acid in a sequence with another that
protein function remain conserved.
PAM UNIT
PAM unit is actually that time during which 1% amino acid undergo for acceptable mutation. If
two sequences diverge by 100 PAM units, it does not mean that they will be at totally different
positions.
Then, PAM1=pii=
PAM ‘n’= (PAM1)n
54 BLOSSUM Matrices
BLOSUM matrices can be used to align the protein sequences. BLOSUM matrices was first purposed
in 1992 by Henikoff et al.
BLOSUM matrices is also called the Block substitution matrix without any gap although it has
mismatches in sequences.
MUHAMMAD IMRAN 41
Figure 0.14 sequence of amino acids which have mismatch but no gap.
Figure
0.15 formula for
computation of BLOSUM MATRICES.
Typically used matrices: BLOSUM62 or PAM120 in PAMx, larger x detects more divergent
sequences.
QVKLFTPLHDKSDHGKYH MQVKIFTPLHDKS-HGKSH
MQVHLY -PLHDKS-TGKSH
MQVHLF -PLHDKSDTGKSH
Figure 0.16 multiple sequence alignment
For pair wise alignment we use Dynamic programming but for multiple alignment it would be
very expensive computationally. So solution for this is progressive alignment.
MUHAMMAD IMRAN 42
MSA helps compare several sequences by aligning them. MSA can extract consensus sequences
from several aligned sequences. Characterize protein families based on homologous regions.
APPLICATION OF MSA
METHODOLOGY
Pairwise alignment is the alignment of two sequences
CONCLUSION
MSA can help align multiple sequences. Progressive alignment can help perform MSA. Need
to remove sequences with >80% similarity.
MUHAMMAD IMRAN 43
Figure 0.19 CLUSTAL – Online tool
http://www.ebi.ac.uk/Tools/msa/clustalo/
MUHAMMAD IMRAN 44
SHORTCOMING OF THIS APPROACH
Dependence upon initial alignments
58 MSA Example
MSA involves progressive alignment of sequences. Doing so many progressive alignments can be
slow.
For example:
MUHAMMAD IMRAN 45
Figure 0.23 Progressive alignment following a guide tree
MUHAMMAD IMRAN 46
Figure 0.24 Alignment results
MSA can be better performed using clustering strategies followed by alignment of the alignments
later. CLUSTAL is a free online tool that does all of this for us!
59 CLUSTAL
MSA involves progressive alignment of sequences. Doing so many progressive alignments can be
slow. CLUSTALW is an online tool to perform MSA.
Developed by European Molecular Biology Laboratory & European Bioinformatics Institute. Performs
alignment in:
• slow/accurate
• fast/approximate
SCOPE
http://www.genome.jp/tools/clustalw
60 Introduction to BLAST – I
National Center for the Biotechnology Information (NCBI) – USA. BLAST developed in 1990. “Basic
Local Alignment Search Tool”. Searches databases for query protein and nucleotide sequences. Also
searches for translational products etc. Online availability www.blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST can be used to search for local alignment of protein and nucleotide sequences. It is available
online. Can perform searches across species and organisms
61 Introduction to BLAST – II
National Center for the Biotechnology Information (NCBI) – USA. BLAST developed in 1990. “Basic
Local Alignment Search Tool”. Searches databases for query protein and nucleotide sequences. Also
searches for translational products etc. Online availability
www.blast.ncbi.nlm.nih.gov/Blast.cgi
Smith Waterman can align complete sequences. BLAST does it in an approximate way. Hence,
BLAST is faster BUT does not ensure optimal alignment. BLAST provides for approximate sequence
matching. Input to BLAST is a FASTA formatted sequence and a set of search parameters
OUTPUT OF BLAST
Results are shown in HTML, plain text, and XML formats. A table lists the sequence hits found along
with scores. Users can read this table off and evaluate results
62 BLAST Algorithm
BLAST can search sequence databases and identify unknown sequences by comparing them to the
known sequences. This can help identify the parent organism, function and evolutionary history.
For example:
Query sequence: PQGELV
Make list of all possible worlds (length 3 for proteins)
PQG (score 15)
QGE (score 9)
GEL (score 12)
ELV (score 10)
Assign scores from Blosum62, use those with score> 11: PQG & GEL
Mutate words such that score still > 11
PQG (score 15) similar to PEG (score 13)
At the end, we get: PQG, GEL and PEG
Find all database sequences that have at least 2 matches among our 3 words: PQG, GEL & PEG.
Find database hits and extend alignment (High-scoring Segment Pair):
63 Types of BLAST
BLAST can search sequence databases and identify unknown sequences by comparing them to the
known sequences. This can help identify the parent organism, function and evolutionary history.
There are two main types of BLAST.
Nucleotides
• Blastx:
Compares a nucleotide query sequence against a protein sequence database.
Helps find potential translation products of unknown nucleotide sequences
• tblastn:
Compares a protein query sequence against a nucleotide sequence database
Nucleotide sequence dynamically translated into all reading frames
• tblastx:
Compares the six-frame translated proteins of a nucleotide query sequence against the sixframe
translated proteins of a nucleotide sequence database.
64 Summary of BLAST
BLAST can search sequence databases and identify unknown sequences by comparing them to the
known sequences. This can help identify the parent organism, function and evolutionary history.
Step1: obtain a query of sequence
MUHAMMAD IMRAN 53
Figure 0.28 tabulated search results
65 Introduction to FastA-I
For comparing two sequences we use pair wise sequencing and for the comparison of many
sequences we use multiple sequence alignment. To handle the multiple alignments we perform
alignment through smith-waterman algorithm for local one. And for global alignment we use
Needleman-wunsch algorithm.
MUHAMMAD IMRAN 54
Both local and global alignments are the dynamic approaches. Many of the sequences are compared,
which takes time and we use BLAST which is an approximate local alignment search tool BLAST
compares a large number of sequences, quickly. FASTA took a similar approach.
Developed in 1988.it does Fast Alignment .Searches databases for query protein and nucleotide
sequences. Was later improved upon in BLAST.
http://www.ebi.ac.uk/Tools/sss/fasta/
MUHAMMAD IMRAN 55
66 Introduction to FastA-II
FASTA – Fast Alignment Algorithm. Classical global and local alignment algorithms are time
consuming. FASTA achieves alignment by using short lengths of exact matches.
USES OF FASTA
FASTA relies on aligning subsequences of absolute identity. Input to FASTA search can be in
FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProt formats
OUTPUT OF BLAST
Results are output in visual format along with functional prediction. Makes table lists the sequence
hits found along with scores. Users can click on each reported match to look at the details.
MUHAMMAD IMRAN 56
Figure 0.33 Input to FASTA: Protein Sequence
MUHAMMAD IMRAN 57
STEP2: Rescore the local regions using PAM or BLOSUM matrix
STEP4: Create a gapped alignment in a narrow segment and then perform Smith Watermann
alignment
MUHAMMAD IMRAN 58
68 Types of FASTA
There are six types of FASTS:
• fasts35
Compare unordered peptides to a protein sequence database
• fastm35
Compare ordered peptides (or short DNA sequences) to a protein (DNA) sequence database
• Fasta35
Scan a protein or DNA sequence library for similar sequences
• Fastx35
Compare a translated DNA sequence (6 ORFs) to a protein sequence database
• tfastx35
Compare a protein sequence to a DNA sequence database (6 ORFs)
• fasty35
Compare a DNA sequence (6ORFs) to a protein sequence database
FASTA performs quick alignments on biological sequences. Several types of FASTA exist which can
assist in comparing DNA/RNA/Protein sequences with each other
69 Summary of FASTA
FASTA can briskly perform sequence search databases if given a query sequence. Multiple types of
FASTA exist which assist in aligning DNA/RNA/Protein sequences
MUHAMMAD IMRAN 59
MUHAMMAD IMRAN 60
Figure 0.36 Step 2: Choose a type of FASTA
http://fasta.bioch.virginia.edu/fasta_docs/fasta35.shtml
MUHAMMAD IMRAN 61
Figure 0.40 Tabulated data
All molecular information of RNA, DNA, Proteins have need to be stored and retrieved.
Sequences are obtained from genome sequencing and mass spectrometry
Structures are obtained from X-Ray Crystallography, Atomic Force Microscopy & Nuclear Magnetic
Resonance Spectroscopy.
Vast amounts of such data exists. Moreover, this data is rapidly accumulating. Online Databases are
formed to store and share this data.
OBJECTIVE
Make biological data available to scientists in computer-readable form
For handling, sharing and analysis of the data
The best way to share is to keep this data on the web
Several sequence, structure and molecular interaction databases exist. These are available online on
the web. Users can freely access and download such data
71 Expasy
It is developed by Swiss Bioinformatics Institute (SIB). Website provides access to databases and tools
Proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc. can be
searched. http://www.expasy.org/
MUHAMMAD IMRAN 62
Figure 0.41 flowchart
MUHAMMAD IMRAN 63
Figure 0.42 prosite scanning section
MUHAMMAD IMRAN 64
Figure 0.44 for local use of protein sequencing
72 Uniprot, SwissProt
Both UniProt and SwissProt are the online database for proteins.
MUHAMMAD IMRAN 65
Figure 0.46 gene, protein or chemical can be find
The sequence
Molecular mass
MUHAMMAD IMRAN 66
Protein sequences from various species and organisms can be found in uniprot. SwissProt is the
manually annotated version of the UniProt Database.
MUHAMMAD IMRAN 67
Figure 0.50 P0CG47 - UBB_HUMAN
Protein Data Bank provides Cartesian coordinates of each atom in the protein structure. Over
50,000 protein structures are reported and present in this database
STORAGE:
Sequence information is stored digitally
Similarity of sequences
Evolutionary History
75 GenBank
Developed by Swiss Bioinformatics Institute (SIB)
Website provides access to databases and tools
Proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc.
Several sequence, structure and molecular interaction databases exist. These are available online on
the web. Users can freely access and download such data
As human brain is limited to remember and store the information for long time that’s why we use online
database for the storage of Molecular information.
ESEMBLE is genome search engine which is used to search the genome of every recorded species.
http://asia.ensembl.org/index.html
MUHAMMAD IMRAN 69
Molecular evolution is the process of change in the sequence composition of cellular molecules such
as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of
evolutionary biology and population genetics to explain patterns in these changes. Genes and Proteins
are modified in this process.
All molecules have an evolutionary history. Phylogenetics is the science of studying evolutionary
relationships. Phylogenetics has led to the creation of relationship trees between various species of
Bacteria, Archaea, and Eukaryota.
(Page and Holmes 2009)
MUHAMMAD IMRAN 70
Conclusion
Phylogenetics is the study of extracting evolutionary relationships between species. Sequence
information from each species is used to measure the difference between the species.
Page, R. D. and E. C. Holmes (2009). Molecular evolution: a phylogenetic approach, John Wiley &
Sons.
77 Evolution of sequences
DNA acts as cellular memory unit and protein are the translated product of DNA coded information.
And evaluation is very important to survive in different type of environments. There are some methods
which brings change or evolution in any organism. (Kluger 2015)
Method of Change
DNA gets modified by:
Discussion
Over time, species evolve to adapt to their circumstances. Since the environment and circumstances
may be different for each species, they evolve uniquely. Unique evolutionary pressures may be
encountered by each cell for struggle of life. However, in which sequence they are presented to the
cells is also unique. Combinations of evolutionary factors are involve in evolution. The evolutionary
events and their combination impart relationships between sequences. These relationships are
explored in Phylogenetics .Several algorithms exist for finding such relationships
Kluger, M. J. (2015). Fever: its biology, evolution, and function, Princeton University Press.
Page, R. D. and E. C. Holmes (2009). Molecular evolution: a phylogenetic approach, John Wiley &
Sons.
To understand the concept of evolution we follow some rules. Phylogenetics involves processing
sequence information from different species to find evolutionary relationships.
Output from such studies include Phylogenetic Trees
In above figure the point A stands for ancestor and with the passage of time the evolution occurred with
and the genome sequence of organisms changed.
MUHAMMAD IMRAN 71
Figure 4 layout of trees
Root node is the ancestor of all other nodes. The direction of evolution is from ancestor to the terminal
nodes.
Conclusion
Phylogenetics specifies evolutionary relationship with the help of trees. Trees can be rooted or
unrooted. Rooted trees can show temporal evolutionary direction.
MUHAMMAD IMRAN 72
Figure 7 computation comparison
Conclusion
Rooted and Unrooted trees have their own advantages and disadvantages. Depending on our
requirement, we can choose between them.
Rooted and Unrooted trees can be used to show phylogenetic relationships between sequences.
Several types of algorithms exist which are divided into two classes. There are many methods for
constructing evolutionary trees.
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up)
hierarchical clustering method. The method is generally attributed to Sokal and Michener.
In this method two sequences with with the shortest evolutionary distance between them are
considered and these sequences will be the last to diverge, and represented by the most recent internal
node.
Least Squares Distance Method. Branch lengths, represent the “observed” distances between
sequences (i & j).
MUHAMMAD IMRAN 73
Find X, Y and Z such that D (i, j) are conserved?
Conclusion
Several methods exist for constructing phylogenetic trees.
Broadly, they belong to objective methods or clustering methods.
We will study UPGMA and Distance Methods.
81 Introduction to UPGMA
Phylogenetic trees can be used to show phylogenetic relationships between sequences. To construct
these trees, several types of algorithms exist which are divided into two classes.
UPGMA: Unweighted Pair – Group Method using arithmetic Averages
N X d XW NY dYW
d ZW
NX NY
Nx is the number of sequences in cluster x
1
d XY d ij
N X NY i X,j Y
MUHAMMAD IMRAN 74
Methods for constructing trees
MUHAMMAD IMRAN 75
• V and B (Calculate),
• V and C,
• V and E, V and F.
Conclusion
UPGMA is a clustering algorithm which can help us compute phylogenetic trees. We will see the
detailed working of this approach in later modules.
82 UPGMA-I
UPGMA has two components to it. These include distance calculations between two clusters and
between two trees.
N X d XW NY dYW
d ZW
NX NY
Nx is the number of sequences in cluster x
1
d XY d ij
N X NY i X,j Y
MUHAMMAD IMRAN 76
Figure 10 the distance matrix is obtained using pairwise sequence alignment
N X d XW NY dYW
d ZW
NX NY
N A d AB N D d DB 1* 6 1* 6
dVB 6
NA ND 1 1
• V and B (Calculate),
• V and C,
• V and E,
• V and F.
MUHAMMAD IMRAN 77
MUHAMMAD IMRAN 78
Conclusion
UPGMA starts with creating clusters of sequences which are the closest. Next, distance is computed
between the new cluster and the remaining sequences. The process is repeated for all sequences.
83 UPGMA-II
UPGMA steps include distance calculations between two clusters and between two trees. We formed
clusters from sequences which had the shortest distance.
Building trees using UPGMA
Combining Clusters: Cluster X + Cluster Y = Cluster Z
Calculate the distance of each cluster (e.g. W) to the new cluster Z
N X d XW NY dYW
d ZW
NX NY
Nx is the number of sequences in cluster x
Calculating the distance between two trees
Assume we have N sequences
Cluster X has NX sequences, cluster Y has NY sequences
dXY : the evlotionary distance between X and Y
MUHAMMAD IMRAN 79
1
d XY d ij
N X NY i X,j Y
Methods for constructing trees
84 UPGMA-III
MUHAMMAD IMRAN 80
UPGMA has two components to it. These include progressive distance calculations between two
clusters or between two trees.
Building trees using UPGMA
Combining Clusters: Cluster X + Cluster Y = Cluster Z. Calculate the distance of each cluster (e.g. W)
to the new cluster Z.
N X d XW NY dYW
d ZW
NX NY
Nx is the number of sequences in cluster x
1
d XY d ij
N X NY i X,j Y
V – E becomes a new cluster lets say W. Now we have to
modify the distance matrix again.
What are the distances between:
W and B, W and C,
W and F.
New matrix
NV dVB N E d EB 2*6 1* 6
dW B 6
NV NE 2 1
NV dVC N E d EC 2 *8 1* 8
dW C 8
NV NE 2 1
NV dVF N E d EF 2*6 1* 6
dW F 6
NV NE 2 1
MUHAMMAD IMRAN 81
Cluster according to min distance
Conclusion
Now we have formed three clusters. Also, two separate trees have been formed. Next, we need to join
these trees to create a complete tree.
85 UPGMA-IV
Application of UPGMA resulted in formation of two sub-trees. The need now was to join them into a
single tree. Let’s see how that is done.
F – B becomes a new cluster lets say X. We have to modify the distance matrix yet again. What is the
distance between trees: W and X.
1
dW X d ij
NW N X i W,j X
1
(d AB d AF d DB d DF d EB d EF )
NW N X
1
* (6 6 6 6 6 6) 6
3* 2
MUHAMMAD IMRAN 82
X – W becomes a new cluster lets say Y. We have to modify the distance matrix
What is the distance between: Y and C.
Conclusion
We have now seen how trees are generated and connected. Next, we need to finalize the tree by
adding the last two clusters.
86 UPGMA-V
Application of UPGMA resulted in formation of two sub-trees. The need now was to join them into a
single tree. Let’s see how that is done.
X – W becomes a new cluster lets say Y. We have to modify the distance matrix. What is the distance
between: Y and C.
MUHAMMAD IMRAN 83
Conclusion
Un-weighted Pair Group Method using Arithmetic Averages is a clustering method to construct
phylogenetic trees. Non-clustering methods such as Maximum Parsimony may be used for making
trees as well.
MUHAMMAD IMRAN 84
Figure 0.1 Ribose sugar has (OH) and Deoxyribose (H)
Because RNA has two (OH) groups that’s why it has shot life spam because of both (OH) repulsion.
Coding RNA
Non-Coding RNA
Coding RNA perform their coded function in protein synthesis. And Non-coding RNA helps in translation
process.
TYPES OF RNA
There are many types of RNA according to their funtions like:
MESSENGER RNA
Only 5-10% of this RNA type is present in cell. Which has variable sequence, variable size and it carries
the genetic information form DNA to Ribosomes where proteins to be assembled. Messenger RNA 5’
end is capped with (7-Methyl Guanosine Triphosphate) which helps the Ribosomes to identify the
mRNA. And 3’ end of the mRNA is poly A tail (around 30-200 adenylate residues) which help shield
against 3’ exonucleases)
MUHAMMAD IMRAN 85
RNA can form 3D structures {Sarver, 2008 #5}, such structural properties helps the RNA molecule to
perform different functions.
As RNA is composed of sugars, phosphate and nucleotides and these nucleotides have ability to form
hydrogen bonds.
‘G’ can also make hydrogen bonds with ‘U’ (Wobble Pair)
Catalytic roles
RNA molecules form many structures for stability and different functions. “Gibs Free Energy”
(LANGRIDGE and KOLLMAN 1987) is the free energy available for RNA molecule for reactions and
RNA structure formation takes place at this lower energy. Incase if RNA has two structure we can select
the one with lowest energy state.
http://chemwiki.ucdavis.edu/Core/Physical_Chemistry/Thermodynamics/State_Functions/Free
_Energy/Gibbs_Free_Energy
MUHAMMAD IMRAN 86
Figure 4 Energy is continuously given out as the RNA molecule folds by pairing complementary
bases
We can calculate the overall energy of RNA structures by summing up energies given out during the
process of folding. For knowing the positive and negative values of calculations of stabilizing and
destabilizing energies we may factor in ways in which RNA can be destabilized.
91 Calculating Energies of Folding - An Example
RNA is composed of four nucleotides (A, U, C and G) and these nucleotides are attached with ribose
sugar in backbone. And these nucleotides have hydrogen bonding between them. G always bond with
C and Always bonds with U through hydrogen bonding and energy is released.
That’s why RNA molecule become more stable.
MUHAMMAD IMRAN 87
5 nucleotides formed H-Bonds. This bond formation released energy (-12.0 kcal/mol) RNA
molecule took up a 2’ structure. Hence became more stable.
All the complimentary bases of RNA combine together to form RNA secondary structures. A simple
nucleotide sequences of RNA is called as Primary structure and denoted by 1’ while when these
nucleotides fold together and form a complex structure that is called secondary structure and denoted
by 2’.
The preferred structure of RNA is 2’ which has many structural patterns like Helices, Loops, Bulges
and Junctions
Figure 8 RNA sequence extends from its 5’ end to 3’ end. Upon folding, 3’ end may fold on to the 5’
end
The first 2’ RNA structure is called helix. Unlike the DNA helix, the RNA helix is formed when the RNA
folds onto itself.
The loop of the hairpin must at least four bases long to avoid steric hindrance with base-pairing in the
stem part of the structure.
Note that hairpins reverses the chemical direction of the RNA molecule.
MUHAMMAD IMRAN 88
Bulges, are formed when a double-stranded region cannot form base pairs perfectly. Bulges can be
asymmetric with varying number of base pairs on one side of the loop. Bulge loops are commonly found
in helical segments of cellular RNAs and used to measure the helical twist of RNA in solution. (Tang
and Draper 1990)The forth type of 2’ RNA structure is interior loop.
Interior loops are formed by an asymmetric number of unpaired bases on each side of the loop.(Turner,
Sugimoto et al. 1988)
Junctions include two or more double-stranded regions converging to form a closed structure.
The unpaired bases appear as a bulge.(Zuker and Sankoff 1984)
Figure 10 Unpaired bases in two 2’ structures form hydrogen bonds with each other
RNA tertiary structures are formed when RNA unpaired base bond in 2’ region bond.
MUHAMMAD IMRAN 89
Figure 11 Hydrogen bonding formation in open nucleotides.
These unpaired nucleotides of 2’ structure interact with other unpaired nucleotides and form a third
structure called tertiary 3’ structure. For example 4 nucleotides in hairpin loop structure does.
The unpaired bases in 3’ structure remain paired by abnormal folding called (pseudoknots) but instead
of pairing they remain available or pairing.
Figure 12 pseudoknots
Tertiary or 3’ structure of RNA may form pseudoknots to detect the pseudoknots in RNA structure we
need “circular plot” which is a graphical approach.
MUHAMMAD IMRAN 90
97 Experimental Methods for Determining RNA Strucutres
RNA has 1’, 2’ and 3’ structures. 1’ has simple nucleotide sequence and 2’ has nucleotides folding and
3’ has knots.
For measuring the RNA structure we use X-ray crystallography (Smyth and Martin 2000), which works
according to the principle of diffraction. Crystallized RNA diffracts X-rays which helps estimate atomic
positions
All isotopes that contain an odd number of protons and/or of neutrons (see Isotope) have an intrinsic
magnetic moment and angular momentum, in other words a nonzero spin, while all nuclides with even
numbers of both have a total spin of zero. The most commonly studied nuclei are 1H and 13C, al
Another method to measure the RNA structure is called as Atomic Force microscopy in this technique
a laser connected to a Si 3N4 piezoelectric probe scans an RNA sample. It works well in air and liquid
environment.
The third method for measuring the RNA structure is Nuclear Magnetic Resonance Imaging in this
method Hydrogen atoms in RNA resonate upon placement in a high magnetic field. It Works well
without crystallizing RNA
MUHAMMAD IMRAN 91
http://www.slideshare.net/Oatsmith/13-nuclear-magnetic-resonance-spectroscopy-wade-7th
STORAGE OF STRUCTURES
Reported structures are stored in online databases. Example includes RNA Bricks and RMDB etc.
Bioinformaticians can refer to these databases for RNA structure studies
RNA Bricks is a database of RNA 3D structure motifs and their contacts, both with themselves and with
proteins
Stanford University’s RNA Mapping Database is an archive that contains results of diverse structural
mapping experiments performed on ribonucleic acids.
Maximizing the number of nucleotides can increase the structure and we have to select the structure
according to the stability.
MUHAMMAD IMRAN 92
The dot plot method for RNA structure prediction is easy. Draw a square and partition by drawing
gridlines. Put RNA sequence on top and left sides of the square. Put a “DOT” on complementary
nucleotides For example:
In longest RNA nucleotides the gaps between complementary nucleotides becomes bulges and loops
of the structure.
• STABILIZING ENERGY
Energy table helps us to find the optimal prediction of structure because energy is released when
complementary nucleotides make bonds.
• DESTABILIZING ENERGY
MUHAMMAD IMRAN 93
Remaining unpaired nucleotides destabilized the RNA structure in form of hairpin or bulge structure.
SUM OF ENERGIES
Sum of stabilizing and destabilizing energies can help determine the quality of a 2’ RNA structure. 2’
structure with longest coupled sequences vs. one with lowest energy
MUHAMMAD IMRAN 94
It Compute energies of all possible 2’ structures. Generate combinations of all computed 2’ structures.
Select the one with lowest energy.
Zuker’s Algorithm involves computing stabilizing and destabilizing energies of a 2’ structure. All
possible 2’ structures are generated. The best 2’ structure is selected!
We need to construct all the possible combinations of nucleotides for selection of optimal 2’ RNA
structure.
MUHAMMAD IMRAN 95
103 Zuker Algorithm – Flowchart
Zuker’s Algorithm involves computing stabilizing and destabilizing energies of 2’ structure. And it also
computes the overall energy by summing up the positive and negative energies.
The diagonal combination from all possible is selected with overall lowest energy.
MUHAMMAD IMRAN 96
Figure 25 Martinez Algorithm flow chart
In Martinez algorithm all the 2’ structures are weighed by its stability and optimal one is sorted
out. Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms
that rely on repeated random sampling to obtain numerical results. They are often used in physical and
mathematical problems and are most useful when it is difficult or impossible to use other mathematical
methods.
And Monto Carlo method do not provide a definitive solution.
MUHAMMAD IMRAN 97
Nussinov-Jacobson (NJ) Algorithm is a Dynamic Programming (DP) strategy to predict optimal RNA 2’
structures, Proposed in 1980. Computes 2’ structures with most nucleotide coupling.
http://ultrastudio.org/en/Nussinov_algorithm
HOW IT WORKS
Start filling each empty position in matrix by choosing the maximum of 4 scores
J 1 2 3 4 5 6 7 8 9
I G G C A A A U G C
1 G 0
2 G3
C4 0 0
A 0 0
5 A
6 A7 0 0
U8
0 0
G
9C 0 0
0 0
0 0
0 0
MUHAMMAD IMRAN 98
Figure 0.3 for maximum score 4 positions are used in scoring
Scoring Matrix
Matrix Initialization
Scoring method
The 4 different positions to be considered for calculating matrix
MUHAMMAD IMRAN 99
The matrix is filled by four different positions. Left, Bottom, Diagonal, and Left/Bottom elements. In
this way all complementary nucleotides coupling is catered.
There can be many traceback. Each traceback is used to make the RNA secondary structure. And
traceback with highest number of nucleotide coupling is selected.
RNA has three different structures 1’, 2’ and 3’. For these structures predictions there are many
algorithms. But in all algorithm there are two main strategies:
1. Nucleotides stacking
2. Energy minimization
ENERGY BASED ALGORITHM.
Zuker’s Algorithm involves energy minimization. It is updated version and incorporate the
phylogenetic information. It is improved. Overcomes the pseudoknots assumes them and accommodate
them. And this algorithm helps to predict the structures of RNA based on nucleotides.
NUCLEOTIDES STACKING ALGORITHM.
NJ’s Algorithm comes under this category. It involves the maximizing the nucleotides pairing. Traceback
helps to find best 2’ structure.
It predict the 75% accurate 2’ structure. Because there may be more than two equal scores as it is
calculated from four different positions. To get best results we need to combine the stacking and energy
minimization methods together.
For further improvements in results we take help from:
Sequences
Comparison
Nucleotide
Covariance analysis
10
MUHAMMAD IMRAN
0
111 Web Resources: RNA Bricks
For prediction of 1’ and 2’ structure of RNA we use different algorithm like Zuker’s, Martinez and N-J.
Online tools also.
The mfold web server is one of the oldest web servers in computational molecular biology. Mfold is
upgraded version of Zuker’s algorithm.
MFOLD is computationally expensive and can give results for 1’ and 2’ structures that have sequences
less than 8000 nucleotides.
Figure 31 http://unafold.rna.albany.edu/?q=mfold
Figure 32 http://unafold.rna.albany.edu/?q=mfold/RNA-Folding-
Form
10
MUHAMMAD IMRAN
1
Figure 33 http://unafold.rna.albany.edu/?q=mfold/Structure-display-and-free-energy-determination
MFOLD helps fold an RNA nucleotide sequence into its possible 2’ structures. MFOLD gives out
several structures along with their energetic stability!
RNA nucleotides folds to form 2’ structure from simple portion of 1’ nucleotides. For example
CUUCGG occurs a wide variety in RNA and it mostly forms the stable hairpin loop. So we can make the
list of all likely 2’ structures arising from 1’.
Figure 34 http://www.rnasoft.ca/strand/
10
MUHAMMAD IMRAN
2
Figure 35 http://iimcb.genesilico.pl/rnabricks
RNA 1’ folds and makes RNA 2’ structure and this online database is established for 2’ RNA structure
and it act as dictionary for 2’ RNA structure.
10
MUHAMMAD IMRAN
3
Figure 0.2 codon (set of three nucleotide) codes for specific amino acid.
Codons select the amino acids and ribosomes make the protein by polymerization process and these
nucleotides coil together to form 3D structure.
Figure 4 Start Codon ATG and Stop Codon TAG, TGA or TAA
10
MUHAMMAD IMRAN
4
In molecular genetics, an open reading frame (ORF) is the part of a reading frame that has the
potential to code for a protein or peptide. An ORF is a continuous stretch of codons that do not contain
a stop codon (usually UAA, UAG or UGA).
https://en.wikipedia.org/wiki/Open_reading_frame
Six ORF exist in any DNA sequence and longest one is marked and first stop codon will be marked end
of the protein.
Codons of 3 nucleotides code for each Amino Acid. There are 1 start and 3 stop codons. Selection of
ORF is based on its length if it the longest one from others than it would be suitable for protein
synthesis reaction.
10
MUHAMMAD IMRAN
5
Figure 7 ORF extraction flowchart
Both reverse and forward RNA sequences are considered which may have many ORF and selection is
based upon longest protein sequences having.
Figure 8 Mechanism
https://en.wikipedia.org/wiki/Edman_degradation
Cyclic degradation of peptides by Phenyl-iso-thio-cyanate (PhNCS). PhNCS attaches to the free amino
group at N-terminal residue. 1 amino acid is removed as a PhNCS derivative.
10
MUHAMMAD IMRAN
6
Figure 9 working of Edam degradation
DRAWBACKS
It is restricted to chain of 60 residues.
Edam Degradation methods helps us to sequence the protein which is unknown. But it is restricted to
60 amino acids only.
Protein can be charged with electrons or protons and if moving charges are placed in between the
magnetic field they get deflected. And their deflection is proportional to their momentum.
Where:
10
MUHAMMAD IMRAN
7
Figure 11 equation for MS application in protein sequencing
COMPUNENTS
Sample Injection
Ionization Source
Mass Analyzer
Ion Detector
10
MUHAMMAD IMRAN
8
Proteins are measured and sequenced if are unknown than matched with existing database if matched
than are shortlisted.
Separation
Ionization
Mass analysis
Detection
Two methodologies are involved
1. Bottom up proteomics
2. Top down proteomics
Bottom up proteomics measures the peptide masses produced after protein enzymatic digestion. And
Top down proteomics measures the intact proteins followed by peptides after fragmentation.
BOTTOM UP PROTEOMICS
In this methodology the protein complex is treated with site specific enzymes which cleaves them into
amino acid residue and resultant peptides are measured for their masses. One peptide is selected at
one time for processing and when all are processed than protein search engine is used for matches.
TOP DOWN PROTEOMICS
In this methodology proteins are ionized and measured for their masses and one protein is mass
selected at a time for fragmentation. And resultant peptide fragments are measured for mass.
We can say that bottom up proteomics deals with peptides while top down proteomics can handle the
whole protein.
1. Bottom up proteomics
2. Top down proteomics
10
MUHAMMAD IMRAN
9
PROTOCOL
1. Sample containing the mixture of protein from cells and tissues is obtained.
2. Enzymes such as trypsin is use to cleave the proteins.
3. Enzyme cleaves the amino acids at specific sites of amino acid.
4. Several peptides are formed when protein is cleaved.
5. Number of peptide depends upon the number of sites where enzymes cleaved the protein. For
example trypsin cleaves the protein at lysine (k)
6. Mass of each peptide is measured.
7. One peptide is selected at a time.
8. Different enzyme is use to cleave the protein at different site.
9. This process keep going until the possible number of peptides are formed or searched.
10. Peptides are searched in data base and matched.
Shotgun Proteomics digest the whole protein and mix first and compared with database. And peptide
mass finger printing involves in protein separation followed by single protein’s peptide analysis.
11
MUHAMMAD IMRAN
0
Bottom up proteomics identifies the proteins by cleaving them into segments at specific sites and was
not suitable to measure the direct protein masse.
PROTOCOL
1. Sample containing the protein mixture from cells and tissues is obtained
2. The entire protein is mixed and analyzed for masses.
3. The list of masses is obtained.
4. TDP Measures all post translational masses of protein.
5. After MS1 one protein is selected at a time and fragmented to obtain its peptides.
6. The process is repeated many times.
Comparison is done from protein database uniProt and swissProt.
TDP also measure the masses of intact proteins and masses of post transcriptional changes.
In Silico
Fragmentation Matching of Experimental Translational of
Candidate Proteins and Insilco Peak List Modifications
Protein Scores
Compare Theoretical
Masses with
Experimental
The flowcharts discussed above can help us arrive at the sequence of the protein in question. Scoring
schemes are required to quantitatively represent the quality of results
11
MUHAMMAD IMRAN
1
125 Protein Ionization Techniques
Protein ionization is used in Mass spectrometry based on proteomics protocols. Ionization involves
loading of proton in protein or removal of protein. Ionizations can increase or decrease the mass of
protein or peptide.
SALIENT IONIZATION
Is the technique which include Matrix Assisted Laser Desorption Ionization MALDI) & Electro Spray
Ionization (ESI) For example:
MALDI
In this technique one proton is added to protein or peptide and the molecular weight is
increases by one and Mass spectrometry reports the molecule at +1.
ESI
ESI adds many protons to protein or peptides and molecular weight is increased by the number of
protons added. But it is difficult in ESI to find the molecule with +1.
EXAMPLE
MS data from MALDI ionization is easier to handle as the product ions masses are mostly at
“1+mass”. ESI is difficult to use as it does not easily give away the +1 charged ion
When we ionize the protein, it can be deflected by a magnetic field in proportion to its mass and the
mass of protein can be measured by spectrometry.
11
MUHAMMAD IMRAN
2
Figure 0.67 MS1 Schematic (Image courtesy Wikipedia)
Mass/charge helps us to calculate the mass of protein, “Mass Select” can help to select specific MS1 for
further analysis.
MS1 results the intact masses of the peptides.
• SCORING SCHEME
11
MUHAMMAD IMRAN
3
All experimental masses are compared with theoretical masses of database and mass is selected on
the base of closeness.
We compare the experimental mass with theoretical data base mass of protein and on base of
closeness we rank or score it.
If several proteins have same score than selection is done by using another technique protein
fragmentation. We fragment the protein or peptide and ionize it, it helps us to measure the fragment
masses as the same ways as their precursor.
There are different techniques for protein fragmentation.
11
MUHAMMAD IMRAN
4
If we can measure the mass of fragments using MS, Calculate the theoretical mass of the fragments.
Then, we can award score on the basis of the similarity of experimental and theoretical mass.
129 Tandem MS
Intact masses can measure the intact proteins or peptides. And this can be followed up by their
fragmentation in MS chamber.
Tandem MS can be extended to the fragments of the intact fragment. All you need is the MS
instrument capability to,
(i) select fragment’s mass range.
(ii) Fragment the precursor fragment.
Tandem MS helps us to measure masses of fragments. By this scoring and protein identifications so
easy.
In MS1, the molecular weight of intact sample molecule is measure and then intact molecule is
fragmented in two afterward, these two fragments are measured by MS or MS2
FRAGMENTATION TECHNIQUES AND MOLECULAR WEIGHT
Fragmentation techniques include ECD, CID etc. intact molecule fragmentation splits the molecule into
two parts.
FRAGMENT MASS
Mass of fragment is produced by MS2 deepening upon the technique because each techniques splits
the protein or peptide at different location.
11
MUHAMMAD IMRAN
5
Mid_Term Syllabus
Figure 0.89 Masses after Fragmentation by ECD, CID & ETD
Complete here
Experimental mass reported from MS2 is matched with theoretical peptides of candidate proteins (from
DB). Score is awarded on the basis of the closeness between experimental and theoretical masses.
11
MUHAMMAD IMRAN
6
Figure 0.91 Masses after Fragmentation by ECD
Peptide sequence tag are the sequence of peptide which are produced after MS2. We can obtain the
sequence of peptide through variation in fragmentation site.
Precursor proteins or peptides fragmentation leads to formation of multiple ions of the same fragment
type. However, fragments have variation in their molecular weights due to variation in site of
fragmentation
Fragmentation at consecutive sites leads to a mass difference equal to that of a single amino acid.
Such consecutive peaks can reveal partial peptide sequence tags
11
MUHAMMAD IMRAN
7
Figure 24 peptide sequencing tagging
Peptide sequence tags can be extracted from peak list iteratively. A high quality mass spectrum will
produce large number of PSTs. The bigger the peptide sequence tags, the better!
PSTs provide clues of the precursor protein/peptides sequence. Consider that we extract the following
PSTs: M, MQ, QV etc. Search protein sequence database (e.g. Uniprot, Swissprot)
Sample sequence in protein DB
>>sp|Q6GZ4X|0X1R_FRG3G Putative transcription factor 0X1R OS=Random virus 3 (isolate Goorha)
GN=FV3-0X1R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVG
HFSGI
KYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGQVLSDLDAKIKAYNLTVEGVEGFVRYS
RVT
KQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMQNVKYILYQ
LLK KHGHGPDGPDILTVKTGSMQLYDDSFRKIYTDLGWKFTPL
For all the proteins in the database, we find out which PSTs exist in which proteins. The protein
reporting the most PSTs is more probable to be the precursor protein.
If many PSTs report the same number or protein report the longer PSTs than through scoring we find
the greater number. After extracting the PSTs we search the entire database for protein who report it.
Additionally, if we include RMSE to the scoring system, then it can highlight better PST matches.
And RMSE is the root mean square error.
11
MUHAMMAD IMRAN
8
Figure 25 root mean square error
MS1 reports the intact mass of molecules (proteins or peptides) in the sample. Intact mass can be
compared with every protein’s mass in database to identify the molecule in the sample.
Incase multiple candidate proteins are reported, MS2 can be performed. MS2 helps measure fragment
peptide masses. MS2 data can be used to extract peptide sequence tags
If the protein identification is still not conformed than each experimentally reported MS2 fragment is
compared with the in silico spectra of proteins from database.
Fragmentation techniques determine product ions e.g. ECD -> c/z and CID -> b/y ions etc. With known
fragment types, we can compute the MW of all possible protein fragments
For obtaining all possible theoretical fragments in a protein, we need to compute the MWs of each
fragment individually
Consider a random protein sequence from DB:
Matching experimental fragments with in silico fragments is the final resort in protein search and
identification.
11
MUHAMMAD IMRAN
9
Accumulate the score
With “all possible” fragments in in silico spectrum, and “reported” fragments in experimental spectrum,
we can match and rank.
Scoring scheme should also consider the errors in peak matching
MS1 and MS2 provide us with a host of data towards enabling us in identifying unknown proteins. A
step by step approach combining MW, PSTs and insilico spectral matching is required.
Integrating MW, PST and insilico comparison algorithms in a workflow can help create a composite
protein search engine. A composite scoring system is also required for this search engine.
MW Match score
PST Match score
In vitro & In silico Match score
For overall cumulative score computation:
Composite scoring schemes are needed to combine scores coming in from multiple criteria. The ability
of a scoring scheme to better isolate true positives from false positives is important.
12
MUHAMMAD IMRAN
0
140 Large Scale Proteomics
Peptide mixtures in bottom up proteomics are very complex. Tryptic peptides may reach up to an order
of 300,000–400,000.In whole proteome samples, protein count may be over 10,000. Experiments have
shown that it is difficult or even impossible to analyze all these peptides in a single analysis, as the
mass spectrometer is essentially overwhelmed.
Over half a million peptides reported in a typical LSP experiment are redundant.
If we could find a unique peptide for a protein, that would make sequence coverage suffer and we have
to strike a compromise between sequences coverage and sample coverage.
TECHNIQUE
One way forward would be to transfer peptides to the MS chamber in a step-by-step manner. However,
this imposes a precondition that a peptide is not selected earlier as well (i.e. more than once)
3. After the initial MS scan, an MS/MS spectrum from peptide A is obtained by selectively
fragmenting this mass only.
4. Next, a spectrum for peptide B is produced, followed by a recording of the MS/MS spectrum for
peptide C.
5. After these three fragmentation spectra have been obtained, a new MS scan is started.
From this scan, three more peptides A B C are selected for fragmentation and the cycle starts over
again.
The number of MS1/2 scans can be limited by carefully selecting the peptide peaks. Once the intense
peptides are identified, next batch of peptides is chosen for MS2.
12
MUHAMMAD IMRAN
1
Figure 28 Formats used for proteomics data
OPEN FORMTAS:
mzXML (tools.proteomecenter.org/mzXMLViewer.php)
MGF (proteomicsresource.washington.edu/mascot/help/data_file_help.html)
Multiple MS data formats exist. Proprietary formats exist which come implemented as software with
hardware. Also, open software standards exist for interoperability etc.
Mass spectrometer outputs data with ionic mass/charge ratios & respective ion intensities.
RAW file is a format in which an instrument outputs data in binary form.
12
MUHAMMAD IMRAN
2
Figure 29 Raw file formats
Multiple RAW file formats are prevailing in the industry. Each vendor has its own unique RAW file
format. You can convert proprietary formats into open formats
12
MUHAMMAD IMRAN
3
http://www.matrixscience.com/help/data_file_help.html
12
MUHAMMAD IMRAN
4
144 Open MS Data Formats
Mass spectrometer outputs data with mass/charge ratios & respective ion intensities. RAW file formats
are specific to each instrument and each vendor has its own unique file format. Once an instrument is
upgraded, data output from the instrument is also changed. Hence the underlying RAW file format
needs to be upgraded as well.
NEED
Proprietary RAW formats are binary formats which are difficult to read and parse. If you have the
software from the maker of the MS then you can read the RAW data file as well.
SOLUTION
mzData was developed by HUPO-PSI
12
MUHAMMAD IMRAN
5
mzXML was developed at the Institute for Systems Biology
To combine them, a joint venture produced mzML
12
MUHAMMAD IMRAN
6
Several software exist for converting RAW file formats into open software formats. Each open format
has its own unique advantages. mzXML and MGF formats are most frequently used
Figure 33
http://www.matrixscience.com/search_form_select.html
12
MUHAMMAD IMRAN
7
Mascot is the most widely used online search tool for proteomics data. However, it lacks a batch
processing mode. Also, it does not cater for top-down proteomics data.
Kelleher et al have developed an online Top down Proteomics Search Engine. “Prosight PTM”.
ProsightPTM searches top down proteomics data and reports the precursor protein
12
MUHAMMAD IMRAN
8
https://prosightptm.northwestern.edu/about_retriever.html
https://prosightptm.northwestern.edu/about_retriever.html
12
MUHAMMAD IMRAN
9
ProSight PTM is the state of the art in top down proteomics search. Using Prosight PTM,
posttranslational modifications can be accurately identified.
Natural elements occur in multiple isotopes. Isotopes differ in their masses.The abundance of each
isotopic variant is unique.
13
MUHAMMAD IMRAN
0
Figure 34 Isotopic variants of natural elements
TYPES OF MASSES
Nominal Mass
Monoisotopic Mass
Average Mass
13
MUHAMMAD IMRAN
1
Figure 36 Detecting Monoisotopic Peaks
MS1 data reports the isotopic distribution of intact molecule’s mass. Monoisotopic mass value has to be
selected from this mass distribution. This value is the highest mass value in the distribution
The first step in protein identification and characterization using mass spectrometry involves intact
protein/peptide mass measurement. Next, we fragment the protein. A protein or peptide backbone may
be fragmented anywhere along the peptide backbone.
This results in formation of two fragments i.e. N-term fragment and C-Term fragment.
For possible fragments let’s take an example protein with 100 residues. Such a molecule’s backbone
can be fragmented at 100 different locations. The total number of possible fragments is then 200
TANDEM MS
The mass of 200 fragments can then be measured by using an MS again. The necessary condition for
this measurement is that all 200 fragments are ionized.
To ensure that all fragments of precursor molecule are also charged, we can use Electrospray
ionization (ESI).ESI induces multiple charges on the intact molecule
Peptide sequence tags help derive clues about the sequence of precursor proteins/peptides. The short
peptide sequences help us in shortlisting candidate proteins from the database.
13
Highlighted By M Zaman MUHAMMAD IMRAN
2
0304-4756496
MS1 helps measure the intact mass of proteins/peptides. A list of candidate proteins/peptides can be
formed by comparing MS1 mass to the mass of proteins/peptides in the database. MS2 or Tandem MS
was performed after fragmentation of intact proteins.MS2 helped extract peptide sequence tags from
MS2 data. Candidate proteins can be further shortlisted by the PSTs
Exhaustive matching of all MS2 peaks with the theoretical fragments of candidate proteins. The set of
theoretical fragments contains all possibilities of fragmentation
Theoretical vs experimental fragments comparison helps as the third stage for shortlisting candidate
protein list. This shortlisting will help you arrive at a small number of proteins
Three scoring schemes can be applied to score the match at each stage of protein search. These
scoring elements can be integrated to arrive at an overall candidate protein score.
Comparisons can be performed at various levels of information. These include MS1, MS2, PSTs and
theoretical fragments comparison. Integrated scoring schemes couple these factors.
13
MUHAMMAD IMRAN
3
comprehensive scoring scheme can combine all the scores. Several optimizations can be undertaken
on the scoring scheme to further improve protein identification
Amino acid have three groups, hydroxyl group, Amine group and R group. The R group is representing
any group.
13
MUHAMMAD IMRAN
4
Figure 0.3 polymerization of amino acids
Amino acids have unique properties such as polarity, charge states and interactions with water. Each of
these properties describes the overall characteristic of an amino acid.
Amino acids have characteristics like polarity, hydrophobicity, and charge states. These characteristics
are governed by the elemental composition of an amino acid’s side chain (R group).
13
MUHAMMAD IMRAN
5
Figure 5 hydrophilic group
Remember
Some Example
Amino acids have unique properties such as polarity, charge states and interactions with water. Each of
these properties describes the overall characteristic of an amino acid.
13
MUHAMMAD IMRAN
6
Remember these Example
Change in Charge
Upon polymerization of amino acids into polypeptide chains, charged amino acids get
neutralized
At pH=7, five amino acids are charged, 2 negatively and 3 positively
Upon polymerization of amino acids into polypeptide chains, charged amino acids get neutralized. At
pH=7, five amino acids are charged, 2 negatively and 3 positively.
+ve
If pH < pK for an amino acid, the amine side chains gain a proton (H+) and become positively charged,
hence basic.
13
MUHAMMAD IMRAN
7
If pH > pK for an amino acid, the carboxyl side chains loses a proton (H+) and become negatively
charged, hence acidic.
Depending on the pH, an amino acid may become charged. This may be positive or negative
depending on the amino acid.
13
MUHAMMAD IMRAN
8
Figure 10 Aliphatic Amino Acids (Non polar C and H chains)
Side chain also impact some properties. Side chains comprising merely of Carbon and Hydrogen are:
Chemically inert,
Poorly soluble in water
However, side chains containing organic acids are very different. They are chemically reactive and
Soluble in water. Elemental composition plays a very important role in determining properties of amino
acids. Solubility and reactivity are key factors participating in protein folding.
Amino acids have several properties such as charge state, polarity and hydrophobicity. It is important
to note that the physical size of each amino acid also varies.
EXAMPLE-1: Glycine
Glycine residues increase backbone flexibility because they have no R group (only an H), hence agile.
EXAMPLE-2: Proline
Proline residues reduce the flexibility of polypeptide chains. Proline cis-trans isomerization is often a
rate-limiting step in protein folding.
13
MUHAMMAD IMRAN
9
Figure 12 cis and Tran’s form of proline
EXAMPLE-3: Cystine
Cysteines cement together by making disulfide bonds to stabilize 3-D protein structures. In eukaryotes,
disulfide bonds can be found in secreted proteins or extracellular domains.
Figure 13 cystine
Amino acids not only have physical and chemical properties, but also structural properties. These
structural properties are equally important in giving rise to protein structures.
• Alpha Helices
• Beta Sheets
Which are stabilized by H-bonds!
14
MUHAMMAD IMRAN
0
The size and structure of each amino acid is unique. Coupled with their chemical properties, each
amino acid can uniquely contribute in the protein folding process.
Hydrophobic core formed by packed secondary structural elements provides compact, stable core.
Upon establishment of a stable protein core, unstable or reactive groups can be added.
"Functional groups" of protein are attached to the hydrophobic core framework. Surface or a protein or
its exterior must have more flexible regions (loops) and polar/charged residues.
The very few hydrophobic "patches" on protein surface are involved in protein-protein interactions. The
active regions in a protein are almost all present on the surface.
Each component of the protein structure has a unique and precise role in the construction of proteins.
Hydrophobic and hydrophilic components have equally useful roles.
Alpha Helix is an example of amino acid folding. Stabilized by H-bonds between every ~ 4th residues in
backbone. Reactive amino acids are exposed for external interactions.
14
MUHAMMAD IMRAN
1
Proteins are made by polymerization of amino acids on ribosomes and proteins properties are linked
to the properties of amino acids. There are 20 amino acids in nature each has different chemical
composition and that’s why each protein is different from other.
Background:
Proteins comprise 20 different amino acids
Amino acids polymerize and form protein molecules
Proteins fold together to take 3D forms Introduction:
But how does a protein actually fold?
The answer is still unknown!
Scientists have spent decades in trying to find a definite answer to this
question, but to no avail
But how does a protein actually fold? The answer is still unknown. Scientists have spent decades in
trying to find a definite answer to this question, but to no avail. Folding of Proteins
• After polymerization of amino acids, linear chains are formed.When these chains
of amino acids are put in water, the proteins fold spontaneously!
The folded protein molecule should have the lowest possible energy. Anfinsen's dogma (also
known as the thermodynamic hypothesis) is a postulate in molecular biology that, at least for
small globular proteins, the native structure is determined only by the protein's amino acid
sequence. Unique, stable and kinetically accessible minimum free energy How do we know the
final folding state of a protein?
Proteins fold spontaneously in water. Proteins fold to achieve thermodynamic stability. Proteins fold to
organize themselves for performing functions in cells.
Proteins are like functional machines in cell, therefore understanding the folding behavior of proteins
can helps us in designing the suitable drug. If a protein is misfolded, then it can lead to a lack of
function in the protein. To study anomalies in structures and to discover newer structural forms,
computational algorithms are used.
Background:
Proteins fold spontaneously
Proteins fold to achieve thermodynamic stability
Proteins fold to organize themselves for performing functions in cells
We can study the folding behavior of protein computationally First, we collect clues & evidences from
experimentally reported structures. We utilize these observations to analyze unknown structures. The
14
MUHAMMAD IMRAN
2
manner in which a newly synthesized chain of amino acids transforms itself into a perfectly folded
protein depends both on the intrinsic properties of the amino-acid sequence (Dobson 2003)
Why study folding
Proteins are the functional machines in cells
Dysregulated protein expressions are a major cause of disease
Understanding protein folding helps design suitable drugs
Computational folding of proteins
o If a protein is misfolded, then it can lead to a lack of function in the
protein
o To study anomalies in structures and to discover newer structural
forms, computational algorithms are used
o How do we study folding, computationally? o First, we collect
clues & evidences from experimentally reported structures o We
utilize these observations to analyze unknown structures
Conclusions
• Given algorithms and procedures to fold a protein, we can fold amino acid chains to
form 3D proteins
• This can help us study misfoldings, interactions between drugs and proteins etc.
Dobson, C. M. (2003). "Protein folding and misfolding." Nature 426(6968): 884-890
Computing the protein folding can help us study misfolding, interaction between drugs and proteins etc.
However, first, it is important to know the number of the protein folding possibilities.
Let’s assume that each amino acid can fold into three different conformations. They are Alpha Helices,
Beta Sheets and Loops. We know that proteins comprise of 100s of amino acids
If each amino acid can take 3 different conformations, and its parent protein has 100 amino acids, then
1003 = 5 x 1047 will be the combination. If it take 1/10th of a Nano-second (10 -10), then to compute all
the folding possibilities will take 1.6 x 1030 years.
In fact, it take a protein less than a second to fold. It’s the Amazing speed of folding.
14
MUHAMMAD IMRAN
3
Figure 0.77 Overall Energy (stability) of the Protein
This is called “Levinthal’s Paradox”. We will try to understand this folding process using experimental
datasets and algorithms. Molecular simulations are also helpful for it.
Electrostatic interactions
van der Waals interactions
Hydrogen bonds
Hydrophobic interactions
14
MUHAMMAD IMRAN
4
Figure 19 Anfinsen’s Experiment
All the information required for folding a protein into its native structure is present within the protein’s
amino acid sequence. The native folded form of protein is thermodynamically most stable as compared
to others
Information required for folding a protein into its native structure is present within the protein’s amino
acid sequence. The native folded form of protein is thermodynamically most stable as compared to
others.
14
MUHAMMAD IMRAN
5
Figure 0.92 Step 2: Arrangement of secondary structures
Figure 20.11 Step 2: Including remaining amino acids and expanding the nucleus
Several models exist for folding a protein given its amino acid sequence. The fundamental requirement
is that the folding process remain spontaneous. There is still no definitive folding hypothesis.
14
MUHAMMAD IMRAN
6
Figure 25 Folding funnel
14
MUHAMMAD IMRAN
7
Figure 28 Cystatin – 3 (C) http://beautifulproteins.blogspot.com/
Protein structures are very complex yet they form spontaneously. We will investigate how to develop
algorithms to predict such structures.
Proteins are made by polymerization of amino acids on ribosomes and proteins properties are linked to
the properties of amino acids. There are 20 amino acids in nature each has different chemical
composition and that’s why each protein is different from other.
Complex protein structures form spontaneously as a protein folds. A huge variety of protein structures
exist. Each structure is designed to perform a specific function. Interestingly, each protein mega
structure gets built out of only a few sub-structures. Combinations from the SMALL substructure set are
used to construct larger protein structures.
There are many types of structure Single Alphabet Amino acid tags can be put together linearly to
represent a protein sequence. This sequence is also called the primary sequence. Primary sequence
can also be referred to as 1’ structure. Sub-structures are formed as a result of 1’ structure’s folding.
Folded sub structures are called secondary protein structures .Secondary structures are also referred to
as 2’ structures.
2’ sub-structures are packed together to form super structures. These protein super structures are
called tertiary structures .Tertiary structures are also referred to as 3’ structures.
3’ structures represent the complete monomeric protein structure.3’ structures can combine with other
polypeptide units to form a quaternary structure.
Quaternary structures are also called 4’ structures. 4’ structures are exemplified by protein complexes
etc.
Protein structures are organized into 1’, 2’, 3’ and 4’ modular conformations. We will investigate how to
develop algorithms to predict these structures
14
MUHAMMAD IMRAN
8
Figure 29 protein folding funnel
Edman Degradation
Tandem Mass Spectrometry
14
MUHAMMAD IMRAN
9
1’ structure databases are essentially protein sequence databases. Examples include Uniprot,
Swissprot amongst several others.
Protein sequences are the primary structures of proteins. The primary or 1’ structure of a protein
determines its initial properties.1’ structure lays the foundation for 2’ structures
The primary or 1’ structure of a protein determines its basic properties and 1’ structure lays the
foundation for 2’ structures. 2’ structures are also referred to as secondary structures.
15
MUHAMMAD IMRAN
0
Figure 30.15 Types of Secondary
Structures – Alpha Helix
Protein sequences fold onto themselves and make H-Bonds to create 2’ structures. Several types of 2’
structures exist. These include Alpha Helices and Beta Sheets.
15
MUHAMMAD IMRAN
1
Properties of Loop
Loops connect helices and sheets
Loops vary in length and 3-D configurations
Loops are mostly located on the surface of proteins
Loops are more “acceptable" of mutations
Loops are flexible and can adopt multiple conformations
Loops tend to have charged and polar amino acids
Loops are frequently components of active sites
Coils
Secondary structure that are not helices, sheets, or recognizable turns
Disordered regions, but also appear to play important functional roles
Loops and Coils are also secondary structure which form the first structures after folding of protein’s
amino acids. Loops and Coils are very important 2’ structures in that they form active sites of proteins.
2’ structures include alpha helices, beta sheets, loops and coils. Upon combination of 2’ structures, a
tertiary or 3’ structure is formed.3’ structure is next level of structure organization.
Formation of 3’ structure
Hydrophobic interactions between nonpolar R-groups
Covalent bonds in the form of Disulphide bridges
Combinations of Alpha helices, Beta sheets, coils and loops help form 3’ structures. Covalent bonds,
Hydrogen bonds and hydrophobic interactions enforce the 3’ structure.
15
MUHAMMAD IMRAN
2
Figure 37 Example of Quaternary Structure See how 2’ and 3’ structures come together
Protein folding results in a linear chain of amino acids getting packed into a compact 3D structure. This
leads to a reduction in bond angles from an initial of 180 degrees (protein’s linear form)
15
MUHAMMAD IMRAN
3
Figure 39 Formation of Planar Peptide Bond
The resultant chain gets its own set of attributes and Peptide bond is planar & rigid.
Dihedral Angles
Angle between two planes (i.e. 4 points)!
Considering the middle two points to be aligned (or overlapped), the angle between the 1 st,
overlapped and the 4th points forms a dihedral angle.
15
MUHAMMAD IMRAN
4
Proteins fold into 3D structures. Phi and psi angles are taken up as a result of folding. These angles can
be measured towards understanding the protein structure.
φ - phi It means phi bond is the bond between the Amino group and
Alpha carbon
Involves C'-N-Cα-C‘
ψ – psi It means psi bond is the bond between the carboxyl group and
Alpha carbon
Involves N-Cα-C'-N
15
MUHAMMAD IMRAN
5
Data as in (Lovell et al. 2003) showing about 100,000 data points for several amino-acids
A limited range of Phi and psi angles are taken up as a result of folding. This range of angles
constitutes the allowable range of torsion or rotation angles that are taken up by the protein.
C-Alpha atoms are traced to recreate a 3D protein structure. The choice is made while keeping planar
nature of the peptide bond in view. Later we will see how to insert side chains into the visual models as
well.
15
MUHAMMAD IMRAN
6
177 Structure Visualization – II
C-Alphas can be used to construct the backbone of a protein towards its visualization. We also
need a representation of measurements for assigning the atomic distances. The ångström is used to
express the size of atoms, molecules and extremely small biological structures, the lengths of
chemical bonds, the arrangement of atoms in crystals.
C-Alpha atoms are traced to recreate a 3D protein structure. Each C-Alpha atom is at a distance which
can be represented in the unit “Angstrom”.1 A resolution is better than 10 A.
C-Alpha atoms are traced to recreate a 3D protein structure. Distances between C-Alphas are
measured in the unit “Angstrom”.
X-Ray Crystallography
Crystallography data gives relative positions of atomic coordinates
The data is obtained from diffractions by the atoms in a protein structure
The coordinates of each atom in x,y and z axis are output
15
MUHAMMAD IMRAN
7
Figure 48 x-ray crystallography
Crystallized proteins are used to determine protein structures. As X-rays diffract from the atoms in a
protein, the atomic distances are noted. These distance in 3D are measured in Angstroms.
Highlighted By M Zaman
0304-4756496
15
MUHAMMAD IMRAN
8
Figure 49 PDB File Format
15
MUHAMMAD IMRAN
9
PDB contains protein structure information. It has the coordinates of C-Alphas for over 50,000 proteins.
Protein structures can be visualized using this information.
Proteins fold into 3D structures. Phi and psi angles are assumed as a result of folding. These angles
can be measured and viewed towards understanding the protein structure. To view a protein, we need
to evaluate the physical location of its atoms. Proteins have Carbon and Nitrogen in their backbone.
CA atomic coordinates
To trace the backbone of a protein, CA atoms trace can be used
Note that CA atoms have the side chains attached to them
A coordinates can be found in the PDB file
16
MUHAMMAD IMRAN
0
Protein structures can be visualized by tracing the CA atoms. Coordinates of CA atoms can be obtained
from the PDB. Next, we need a tool to plot these coordinates.
Online Tools
Rasmol and CHIME are basic tools for visualizing proteins
Swiss PDB Viewer offers several features such as protein surface view, alignment of several
proteins & modelling secondary structures
PyMOL is a python-script based tool for visualizing the protein structure
Cn3D is another tool which helps us visualize protein structures
It also provides for annotating protein structures
16
MUHAMMAD IMRAN
1
16
MUHAMMAD IMRAN
2
Protein structures are visualized using several online tools. These tools include Rasmol, CHIME, Swiss
PDB Viewer and Cn3D.
CPK: Corey-Paulin-Koltun Diagrams. In CPK diagrams, each atom is represented by a solid sphere.
Spheres are equal to atomic van der Waal radius (the volume of the atom).
http://www.danforthcenter.org/smith/MolView/Over/overview.html
Ribbon Diagrams
Ribbon diagrams are an easy and frequently used technique for representing protein structures.
Structure is represented by the secondary structures (fold) using simple cartoon figures.
It is also called cartoon diagrams
http://www.danforthcenter.org/smith/MolView/Over/overview.html
16
MUHAMMAD IMRAN
3
Figure 53 Colored Sticks Models
http://www.danforthcenter.org/smith/MolView/Over/overview.html
Protein Structure Visualization can be performed using several atomic representations. These include
CPK, Ribbon and Balls & Stick Diagrams.
Minimizing Energy
We know that if bonds can be formed between two atoms, then energy is released. This leads to a
situation where there is lesser free energy accessible to each atom for further interactions. So, proteins
maximize bonds that can be made between the side chains on each of their constituent amino acids
As we know the greater the number of bonds between the amino acids, the more stable a
protein becomes.
16
MUHAMMAD IMRAN
4
Figure 54 Energies of Interactions www.ucdavis.edu
Energies of protein structures can be computed by first enumerating the types of interactions
between each atom. Then, accumulating the energy of each interaction towards calculating an
overall energy of a protein.
X-Ray Crystallography
16
MUHAMMAD IMRAN
5
Nuclear Magnetic Resonance (NMR) Spectroscopy
16
MUHAMMAD IMRAN
6
Figure 57 from Diffraction Patterns to Atomic Positions
Upon establishing the atomic positions and distances, we can then check for possible interaction
between the different atoms. Atomic distances can help us classify interaction types e.g. hydrogen
bonds, electrostatic & polar.
16
MUHAMMAD IMRAN
7
X-Ray Crystallography data shows that Hydrogen atoms of N-Term may come together with Oxygen
atoms of C-term amino acid at 4th neighboring position. Their atomic distance is ~1.9A and hence are
considered to be in a hydrogen bonds.
X-Ray Crystallography of protein shows that Hydrogen atoms of N-Term come together with Oxygen
atoms of C-term amino acid at 4th neighboring position to make Hydrogen bonds.
16
MUHAMMAD IMRAN
8
Figure 60 Carbons (Black) & Nitrogen’s (Blue): 1-5, 2-6, 3-7…
Helix Formers
From 20 amino acids, anyone can be present in the backbone. Is there a variable preference in amino
acids to form helix? Yes, “Helix Formers” are generally hydrophobic amino acids (M, A, L…). Alpha
Helices are formed by hydrogen bonding (O-H) between Ci and Ni+4 atoms in the protein backbone.
16
MUHAMMAD IMRAN
9
This is called a parallel beta sheet.
Beta strands can make hydrogen bonds with each other and organize as beta sheets.
Beta Sheets have different Properties:
Beta Strand
Beta Sheet
Beta Barrel
Beta Sandwiches
Beta Barrels
Beta Barrel is made of a single beta sheet that twists and coils upon itself. The first strand in the beta
sheet makes a hydrogen bonds with the last strand. A beta barrel is a large beta-sheet that twists and
coils to form a closed structure in which the first strand is hydrogen bonded to the last. Beta-strands in
beta-barrels are typically arranged in an antiparallel fashion. https://en.wikipedia.org/wiki/Beta_barrel
Beta Sandwiches
Beta Sandwiches are made of two beta sheets which are usually twisted and packed so their strands
are aligned.
17
MUHAMMAD IMRAN
0
Figure 63 Illustration of the β-sandwich from Tenascin C (PDB entry: 1TEN).
Beta Sheets are formed by H bonds between of 5–10 consecutive amino acids in one portion of the
backbone with another 5–10 farther down the backbone. Beta strands may be adjacent (with a loop in
between) or far with other structures in between.
Loops are formed by amino acids present in the middle of the Alpha Helices and Beta Sheets in a
protein backbone.
Variability in length and conformation allows loops to join Alpha Helices and Beta Sheets in a variety of
ways. Loops are variable in length and 3-D conformations.
17
MUHAMMAD IMRAN
1
Characteristics
Loops are mostly located on the surface of protein structure
Mutate in sequence at a much faster rate than Alpha Helices and Beta Sheets
Loops are flexible and can adopt multiple conformations
Loops dictate the overall structure of protein as they couple Alpha helices and beta sheets
Loop Properties
Loops are mostly comprised of charged and polar amino acids
Loops frequently participate as components of active sites
Types of Loops
Hairpin loops are two amino acids long and join anti-parallel Beta strands
17
MUHAMMAD IMRAN
2
Other Loops may be 3 to 4 amino acids long
Loops fall into various families
Loops are the third type of secondary structure after Alpha helices and Beta sheets. Loops are unique
in that they are flexible and variable length. Loops constitute active sites.
Coils are those secondary structure formed by the protein backbone which are neither helices, sheets
nor loops. In fact, coils do not have a consistent classifiable structure. Hence, coils are random
structure and random length.
Proteins have primary, secondary, tertiary & quaternary structures. Each level of protein structure
organization is known to impart specific characteristics to the protein.
17
MUHAMMAD IMRAN
Highlighted By M Zaman 3
0304-4756496
Structural artifacts tend to be more conserved as compared to their sequences. Therefore, it may be
useful to look at the secondary/tertiary structures for conservation study.
Classification
The evolution of protein structures and their hierarchy is not systematized
Hence, we need to classify the function of protein by examining their secondary and tertiary
structures
Motifs (Non-functional Combinations of 2’ structures)
Domains are semi-independent functional structures in a protein. Have a stable structure. Over ~40
residues. Protein may contain multiple domains.
Types of Domains
Alpha Domains
Beta Domains
Alpha/Beta Domains
Alpha + Beta Domains
Alpha & Beta Multi-Domains
17
MUHAMMAD IMRAN
4
Membrane & cell-surface proteins
So, by looking at proteins, we can list the domains present in each protein. Once domains in each
protein are listed, we can classify whole proteins into various types and classes.
Alpha Domains
Beta Domains
Alpha/Beta Domains
Alpha + Beta Domains
Alpha & Beta Multi-Domains
Membrane & cell-surface proteins
17
MUHAMMAD IMRAN
5
Figure 72 Alpha / Beta: Triosephosphate isomerase (1hti)
Various types of domain architectures exist in proteins. Such architectures can be classified into
general structural classes. Databases can be made from classes.
17
MUHAMMAD IMRAN
Highlighted By M Zaman 6
0304-4756496
Figure 74 Structural Classes
Class
Similar secondary structure content
Architecture
Also called FOLD
Major structural similarity
SSE’s in similar arrangement
Topology
Super Family
Probable common ancestry
Family membership
Homology
Same Family
Clear evolutionary relationship
Pairwise sequence similarity > 30%
17
MUHAMMAD IMRAN
7
CATH classifies proteins by their structural similarity. It also considers the internal organization of the
structural components in proteins.
Proteins are classified into various structural classes. CATH is one such system in which proteins are
organized into classes, architecture, topology and homology.
http://scop.mrc-lmb.cam.ac.uk/scop/
17
MUHAMMAD IMRAN
8
Figure 75 SCOP Classification Statistics
http://scop.mrc-lmb.cam.ac.uk/scop/count.html
FSSP - Family of Structurally Similar Proteins, based on the DALI algorithm. Pclass - Protein
Classification, based on the LOCK and 3Dsearch algorithms.
Proteins are assembled into primary (1’), secondary (2’), tertiary (3’) and quaternary (4’) structures.
Protein sequence is less conserved than its structure. Protein structure determines function. Since
protein structure dictates function, comparing two structures can help us evaluate if the proteins do the
same or similar function.
17
MUHAMMAD IMRAN
9
Figure 74 C-Alpha atoms in backbone
http://www.danforthcenter.org/smith/MolView/Over/overview.html
PDB coordinates of Alpha Carbons in the protein back bone can be used for comparison. In this way,
whole protein structure or domains etc. can be compared.
http://www.danforthcenter.org/smith/MolView/Over/overview.html
Whole protein structures can be compared by calculating the root mean squared difference (RMSD)
between their Alpha Carbons positions. The lower the RMSD, the similar are the proteins.
18
MUHAMMAD IMRAN
0
Motifs, Domains and Full Proteins can be compared by using the rigid body super-positioning.
Depending on the RMSD, proteins, their motifs and domains can be selectively compared.
18
MUHAMMAD IMRAN
1
Protein structures can be compared in multiple ways. Till now, we can compare proteins by their
motifs, domains and full structures. There are several advanced techniques for this as well.
18
MUHAMMAD IMRAN
2
204 Protein Structure Prediction
Complex protein structures enable proteins to perform complex functions. We know over a million
protein sequences but only about 100,000 protein structures. Estimating exact protein structures is very
difficult. It’s difficult to crystallize proteins. Even if we manage to get protein’s X-Ray, to reconstruct the
structure is extremely complex.
Since we know so many sequences, they can be used for predicting protein structures. This indeed is
possible and helpful.
18
MUHAMMAD IMRAN
3
3D Structure of proteins is determined by their Amino Acid sequence. Note that we only know 100,000
3D protein structures, but 10 times more sequences. For those proteins whose structure is already
known, can we evaluate their amino acid sequence?
Given an amino acid sequence, look up the propensity table for each amino acid’s propensity for
various 2’ structures. Product of these propensity values will give you the overall propensity for
formation of each 2’ structure.
18
MUHAMMAD IMRAN
4
You only need to compute propensities for a small number 2’ structures. The highest net propensity will
be the most probably secondary structure that will be formed.
18
MUHAMMAD IMRAN
5
Alpha Helices are formed from 4 contiguous amino acids having an Alpha-Helix propensity over 1.0.
The Alpha-Helix stops if this propensity falls below 1.0. Once Alpha Helices are constructed, and
concluded, the remaining amino acids can be evaluated for Beta sheets and turns etc. Let’s see how
Beta sheets are evaluated using Chou Fasman Algorithm.
http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
Chou Fasman Algorithm helps predict Alpha Helices, Beta Sheets and Turns. The algorithm is based
on statistical occurrence of Amino Acids in known structures.
18
MUHAMMAD IMRAN
7
Beta sheets can be predicted from primary amino acid sequences. Next, we will see the flowchart of
Alpha Helices and Beta Turns.
Now we have reviewed flowcharts for Alpha Helices and Beta Sheets. Next up is the flow chart for Beta
Turns.
18
MUHAMMAD IMRAN
8
214 Chou Fasman Algorithm – Flowchart III
Chou Fasman Algorithm helps predict secondary structures such as Alpha Helices, Beta Sheets and
Turns. Step by step flowchart of the entire algorithm.
Alpha helices, beta sheets and turns can be predicted using Chou Fasman Algorithm. This algorithm is
based on statistical analysis of amino acid occurrences in proteins.
18
MUHAMMAD IMRAN
9
Alpha helices, beta sheets and turns can be predicted using Chou-Fasman Algorithm. The algorithm is
based on statistical analysis of amino acid occurrences in proteins.
Secondary structure propensity values of alpha helix, beta sheet and turns should be recalculated with
the latest protein data sets.
IMPROVEMENTS
Special consideration for:
Nucleation regions
Membrane proteins
Hydrophobic domains
Consider variable coil and loop sizes besides the from tetra peptide turns
Structure Classification
relationship between protein structure and function
There is need to classify proteins
Hierarchy of classification
Structure visualization, classification and prediction equip us to perform functional evaluation of
proteins. This is important for understanding disease and designing drugs for treating them.
19
MUHAMMAD IMRAN
0
Which combination of identity and alignment length is suitable for best for structure prediction?
Background
19
MUHAMMAD IMRAN
1
Conclusions
Good sequence alignment and identity ensures that homology modelling will give
accurate results
Next, what is the workflow for homology modelling?
19
MUHAMMAD IMRAN
2
Introduction
Conclusions
Next, we will proceed to perform homology modelling
For that there is a seven step procedure which we will see in the next module.
Backbone generation
3. Loop modeling
4. Side-chain modeling
5. Model optimization
6. Model validation
19
MUHAMMAD IMRAN
3
Conclusions
Homology modelling works in seven steps
It is a repetitive process
Next, we will look at each step in detail!
19
MUHAMMAD IMRAN
4
19
MUHAMMAD IMRAN
5
Conclusions
Now the template and target are selected
Next, we perform fine-tuning of alignment and introduce corrections to ready the mismatches and
gaps
19
MUHAMMAD IMRAN
6
19
MUHAMMAD IMRAN
7
Conclusions
19
MUHAMMAD IMRAN
8
Template recognition and initial alignment ✓ Alignment correction
Backbone generation
Loop modeling
Side-chain modeling
Model optimization
Model validation
Conclusions
The protein backbone is ready!
Next, loops were modelled and used to bridge gaps
Next step, using this backbone and loop choices, place the side-chains
Backbone generation
Loop modeling
Side-chain modeling
Model optimization
Model validation
19
MUHAMMAD IMRAN
9
The backbone of tyrosine strongly prefers two rotamers and the real side-chain may fit one of them!
Next….
20
MUHAMMAD IMRAN
0
Conclusions
Now we have minimized large errors
However, smaller errors may still exist
Next step, validate the model that we have constructed!
20
MUHAMMAD IMRAN
1
Limitations of Homology Modelling
Large Bias towards structure of template
Cannot study conformational changes
Cannot elicit new catalytic/binding sites
Conclusions
So how can we overcome such limitations?
Other strategies include: Threading, and Ab Initio Modelling
We will also examine online tools for each
Background
20
MUHAMMAD IMRAN
2
.log : log output from the run.
.B* : model generated in the PDB format.
.D* : progress of optimisation.
.V* : violation profile.
.ini : initial model that is generated.
.rsr : restraints in user format.
.sch : schedule file for the optimisation process.
Automated Modelling Servers Swiss Model
http://swissmodel.expasy.org//SWISS-
MODEL.html
Robetta http://robetta.bakerlab.org/
20
MUHAMMAD IMRAN
3
3D Jigsaw
http://www.bmm.icnet.uk/servers/3djigs
aw/
Phyre
http://www.sbg.bio.ic.ac.uk/phyre/
Conclusions
Homology modelling helps predict protein structures by using prior structural information
Several tools are available to perform homology modelling in a programmatic or automated way!
Backbone generation
Loop modeling
Side-chain modeling
Model optimization
Model validation
Introduction
A protein fold is defined by the way the secondary structure elements of the structure are
arranged relative to each other in space.
Common folds include 4-helix bundle and the TIM barrel. Introduction
5,000 stable folds in nature
Fold recognition: Finding the best fit of a sequence to a set of candidate folds
Conclusions
Fold recognition or Threading is a technique for predicting protein structures
It is useful in cases where homology modelling fails to predict quality structures
228 iTASSER
Background
20
MUHAMMAD IMRAN
4
Technique for predicting protein structures
Employed when homology modelling cannot predict quality structures
Conclusions
Threading involves “passing” the amino acid sequence through each fold in the
database
The best match is computed using a scoring function
20
MUHAMMAD IMRAN
5
Conclusions
Combinations of secondary structures come together to form the best prediction
Scoring typically involves using a Z-Score function based on energy of a molecule
20
MUHAMMAD IMRAN
6
Iterative threading assembly refinement (I-TASSER) server
Software for automated protein structure &function prediction based on the
sequencetostructure-to-function.
Steps:
1. Starts from amino acid sequence
2. i-TASSER first generates 3D atomic models from multiple threading alignments and iterative
structural assembly simulations.
i. The function of the protein is then inferred by structurally matching the 3D models
with other known proteins.
3. Outputs full-length secondary & tertiary structures and functional annotations on ligandbinding
sites
4. An estimate of accuracy of the predictions is provided based on the confidence score of the
modeling
Conclusions
20
MUHAMMAD IMRAN
7
✓ iTASSER helps thread amino acid sequences on fold and secondary structure
databases ✓ It also helps predict function of structures output.
structures Advantages
Introduction
20
MUHAMMAD IMRAN
8
Conclusion
• 3D-1D methods convert structure and environment information into “profiles”
• Score for each amino acid is computed for each profile
Limitation
Computationally expensive
20
MUHAMMAD IMRAN
9
Suitable for proteins with less than 100 residues
Conclusion
Ab initio methods rely on computing the energies of folded proteins
The protein structures with the lowest energy are deemed as plausible predictions
Background
Rationale
Ab initio methods rely on computing the energies of folded proteins
The protein structures with the lowest energy are declared as plausible predictions ✓
Sometimes it so happens that even slightly homologous proteins may not be available.
This renders homology modelling and threading/fold recognition as futile
Also, newer protein structures continue to be discovered every day
These could not have been identified by methods which only rely on matching with
available structures
Conclusion
Ab initio methods, in contrast, base their predictions on physical models for these
mechanisms
Energy released during the folding process is computed for predicting structure
21
MUHAMMAD IMRAN
0
2. Define an energy function mapping structures to energy values. We have to minimize this
later!!
3. Solve the computational problem of finding the global minimum.
21
MUHAMMAD IMRAN
1
1. Build an accurate initial model (including energy and forces).
2. Accurately simulate the dynamics of the protein folding process.
3. The native structure will steadily emerge.
Conclusion
✓ Start with an energy function Fold structures in order to obtain the most stable structure This
structure will have the minimum energy
Background
Conclusion
The protein structure reporting lowest energy is selected to be the optimal structure ✓
How easy is it to compute the “really” lowest energy of a folded protein?
The protein structure reporting lowest energy is selected to be the optimal structure ✓
How easy is it to compute the “really” lowest energy of a folded protein?
Best Case Energy Function
Clear energy minimum in the native structure
Viable path towards this minimum
Global optimization finds the most stable structure
Background
Advantages
a. Ab Initio methods can fold any target sequence using only physical atomic properties
b. Predictions are mostly accurate and correctly describe the natural folding process
Disadvantages
1. Homology Modelling
2. Fold Recognition
3. Ab Initio Modelling
Conclusion
Homology modelling is performed in cases of high identity and alignment score
For the “Twilight zone”, other strategies are employed
1. Homology Modelling
2. Fold Recognition
3. Ab Initio Modellin
Conclusion
1. For low identity and alignment scores, a “Twilight zone” for structure prediction
exists
1. Homology Modelling
2. Fold Recognition
3. Ab Initio Modelling Energy Optimization in Ab Initio Modelling
Conclusion
For cases where even the fold libraries do not give any high scoring matches, Ab Initio
strategies can help model the structure
However, this is a complex and computationally expensive process
Molecular Evolution
1. Insertions
2. Deletions
3. Substitutions
Phylogenetic Trees
1. Scaled Trees
2. Unscaled Trees
Phylogenetic Trees
Rooted Trees
Unrooted Trees
Important Concepts
Online tools:
Mascot
Sequest
Prosight PC
Definition of Bioinformatics
Need for Bioinformatics Areas within Bioinformatics
Bioinformatics as an interdisciplinary area
Need to store, process and analyze biological data
Requirement of newer faster algorithms
Specific areas focused were:
Comparing sequences
Comparing structures
Predicting structures
We studied the basic algorithms for each topic
With evolution and growth of Bioinformatics, newer and better algorithms are now also available!
For advanced study in Genomics, you may take “Computational Genomics” course
Personalized Medicine
BIGDATA
You can make a startup company which manages and process health BIGDATA!
All it needs is basic software development skills coupled with
The next Google, Facebook and Uber is going to emerge from Health and Bioinformatics
Pharmaceutical companies are investing into bioinformatics human resource development
Jobs Market
Pharmaceutical Giants
Research Centers & Universities Hospital & Diagnostic IT departments
Highlighted By M Zaman
0304-4756496 MUHAMMAD IMRAN 221