Bioinformatics Session11
Bioinformatics Session11
Bioinformatics Session11
Session 11
TBLASTN, PSI-BLAST, DELTA-BLAST …
Why are they around?
Name the BLAST programs you have used till now?
• BLASTn
• BLASTp
• Smart BLAST
• Global Align
• Multiple Alignment
What are they used for?
When are they used? Example…
• BLASTn:
• Nucleotide to Nucleotide search
• BLASTp:
• Protein to Protein search
• BLASTn vs BLASTp: Which is preferred and why?
• Which scoring matrix/scheme is generally used and why?
TBLASTN, PSI-BLAST, DELTA-BLAST …
Why are they around?
• BLAST Searching with Multidomain Protein: HIV-1 Pol
• The Gag‐Pol protein of HIV‐1 (NP_057849.4) is a multidomain protein
of 1435 AA residues with protease, reverse transcriptase, and
integrase domains.
Kinds of searches we can perform with
such a viral protein:
Graphical overview: Clicking on domains takes you
to domain databases
List of alignments (query-anchored with
dots for identities)
Perfectly conserved
Rarely substituted residues
• Taxonomy report for a BLASTP search shows
an overview of which species have proteins
matching the HIV‐1 query.
• Most matches are viral, but others include
rabbit, fungal, pig, and insect sequences.
• To learn more about the distribution of Pol proteins
throughout the tree of life, we may further ask what
bacterial proteins are related to the viral HIV‐1 Pol
polyprotein.
bacterial matches to
HIV-1 ribonuclease H
domain bacterial matches to
HIV-1 integrase core
domain
• This suggests that the ribonuclease H and integrase core domains of HIV‐1 match many dozens of bacterial
proteins.
• You can inspect pairwise alignments to confirm that the viral and bacterial proteins are homologous, often
sharing about 30% amino acid identity over spans of over 150 amino acids.
BLAST searching HIV-1 pol against human
sequences
BLASTX nr
or
BLASTP nr
A general strategy for “Find-a-gene project” to practice BLAST
Inspect the
output
BLASTX nr
or
BLASTP nr
2) Perform a TBLASTN search against a DNA database consisting of genomic DNA or ESTs.
Include the output of that BLAST search in your document.
A general strategy for “Find-a-gene project” to practice BLAST
3) You need to distinguish between a perfect match to your query Inspect the
(not “novel”), a near match (might be “novel”, depending on the output
results), and a nonhomologous result.
BLASTX nr
or
BLASTP nr
2) Perform a TBLASTN search against a DNA database consisting of genomic DNA or ESTs.
Include the output of that BLAST search in your document.
Gather information about this “novel” protein
• At a minimum, identify the protein sequence of the “novel”
protein as displayed in the BLAST results.
• Propose a name for the novel protein (e.g., “Krishnazoa globin”),
and report the species from which it derives.
• It is very unlikely (but still possible) that you will find a novel gene
from an organism such as S. cerevisiae, human, or mouse,
because those genomes have already been thoroughly
annotated.
• It is more likely that you will discover a new gene in a genome
that is currently being sequenced, such as bacteria or mosses or
protozoa.
A general strategy for “Find-a-gene project” to practice BLAST
Inspect the
output
BLASTX nr
or
BLASTP nr
(4)
• Use the DNA sequence of the EST and perform a BLASTX query against the nonredundant (nr) database
• As an alternative strategy, take the encoded protein sequence, and use it as a query in a BLASTP search of the
nonredundant (nr) database at NCBI.
Demonstrate that this gene, and its
corresponding protein, are novel
For the purposes of this course, “novel” is defined as follows.
• If there is a 100% identity match to a protein in the database from the same
species, then your protein is NOT novel (even if the match is to a protein with
a name such as “unknown”).
• If the best match is to a protein with < 100% identity to your query, then it is
likely that your protein is novel and you have succeeded.
• If there is a match with 100% identity but to a different species than the one
you started with, then you have succeeded in finding a novel gene.
• If there are no database matches to the original query from step (1), this
indicates that you have found a DNA/protein that is not homologous to the
original query. You should start over.
Confirm if the novel protein is hit for your
query
• Generate a multiple sequence alignment with your novel
protein, your original query protein, and a group of other
members of this family.
• A typical number of proteins to use in a multiple sequence
alignment is a minimum of 5 or 10 and a reasonable maximum
is 30.