Bioinformatics Session11

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Bioinformatics (BIO213)

Session 11
TBLASTN, PSI-BLAST, DELTA-BLAST …
Why are they around?
Name the BLAST programs you have used till now?
• BLASTn
• BLASTp
• Smart BLAST
• Global Align
• Multiple Alignment
What are they used for?
When are they used? Example…
• BLASTn:
• Nucleotide to Nucleotide search
• BLASTp:
• Protein to Protein search
• BLASTn vs BLASTp: Which is preferred and why?
• Which scoring matrix/scheme is generally used and why?
TBLASTN, PSI-BLAST, DELTA-BLAST …
Why are they around?
• BLAST Searching with Multidomain Protein: HIV-1 Pol
• The Gag‐Pol protein of HIV‐1 (NP_057849.4) is a multidomain protein
of 1435 AA residues with protease, reverse transcriptase, and
integrase domains.
Kinds of searches we can perform with
such a viral protein:
Graphical overview: Clicking on domains takes you
to domain databases
List of alignments (query-anchored with
dots for identities)

Perfectly conserved
Rarely substituted residues
• Taxonomy report for a BLASTP search shows
an overview of which species have proteins
matching the HIV‐1 query.
• Most matches are viral, but others include
rabbit, fungal, pig, and insect sequences.
• To learn more about the distribution of Pol proteins
throughout the tree of life, we may further ask what
bacterial proteins are related to the viral HIV‐1 Pol
polyprotein.

• Repeat the BLASTP search with NP_057849 as the


query, but limit the search to “Bacteria”
BLASTP searching HIV-1 pol against bacterial proteins

bacterial matches to HIV-1


retropepsin, reverse
transcriptase domains

bacterial matches to
HIV-1 ribonuclease H
domain bacterial matches to
HIV-1 integrase core
domain

• This suggests that the ribonuclease H and integrase core domains of HIV‐1 match many dozens of bacterial
proteins.
• You can inspect pairwise alignments to confirm that the viral and bacterial proteins are homologous, often
sharing about 30% amino acid identity over spans of over 150 amino acids.
BLAST searching HIV-1 pol against human
sequences

Question: are there human homologs


of HIV-1 pol protein?
Query: HIV-1 Pol
Program: BLASTP
Database: human nr (nonredundant)
Matches: many human proteins
share significant identity.
BLAST searching HIV-1 pol against human
sequences

Question: are there human RNA


transcripts corresponding to HIV-1 pol?
Query: HIV-1 Pol Program: TBLASTN
Database: human ESTs
Matches: many human genes are actively
transcribed to generate transcripts
homologous to HIV-1 pol.

TBLASTN/X helps in searching for super diverged species


PSI-BLAST and DELATA BLAST serves the same purpose
Using BLAST for gene discovery: FIND-A-GENE

• A common problem in biology is finding a new gene.


• Traditionally, genes and proteins were identified using the techniques
of molecular biology and biochemistry.
• Such experimental biology approaches will always remain essential
but has practical limitations.
• Bioinformatics approaches can also be useful to provide evidence for
the existence of new genes.
• For our purposes a “new” gene refers to the discovery of some DNA
sequence in a database that is not annotated (described).
• You may want to find new genes for many reasons:
• Can you think of a few?
A general strategy for “Find-a-gene project” to practice BLAST

Start with the sequence TBLASTN


of a known protein

Eg. human beta globin


(NP_000509) to search for novel Inspect the
output
globin gene

BLASTX nr
or
BLASTP nr
A general strategy for “Find-a-gene project” to practice BLAST

Start with the sequence TBLASTN


of a known protein

Inspect the
output

BLASTX nr
or
BLASTP nr

2) Perform a TBLASTN search against a DNA database consisting of genomic DNA or ESTs.
Include the output of that BLAST search in your document.
A general strategy for “Find-a-gene project” to practice BLAST

Start with the sequence TBLASTN


of a known protein

3) You need to distinguish between a perfect match to your query Inspect the
(not “novel”), a near match (might be “novel”, depending on the output
results), and a nonhomologous result.

BLASTX nr
or
BLASTP nr

2) Perform a TBLASTN search against a DNA database consisting of genomic DNA or ESTs.
Include the output of that BLAST search in your document.
Gather information about this “novel” protein
• At a minimum, identify the protein sequence of the “novel”
protein as displayed in the BLAST results.
• Propose a name for the novel protein (e.g., “Krishnazoa globin”),
and report the species from which it derives.
• It is very unlikely (but still possible) that you will find a novel gene
from an organism such as S. cerevisiae, human, or mouse,
because those genomes have already been thoroughly
annotated.
• It is more likely that you will discover a new gene in a genome
that is currently being sequenced, such as bacteria or mosses or
protozoa.
A general strategy for “Find-a-gene project” to practice BLAST

Start with the sequence TBLASTN


of a known protein

Inspect the
output

BLASTX nr
or
BLASTP nr

(4)
• Use the DNA sequence of the EST and perform a BLASTX query against the nonredundant (nr) database
• As an alternative strategy, take the encoded protein sequence, and use it as a query in a BLASTP search of the
nonredundant (nr) database at NCBI.
Demonstrate that this gene, and its
corresponding protein, are novel
For the purposes of this course, “novel” is defined as follows.
• If there is a 100% identity match to a protein in the database from the same
species, then your protein is NOT novel (even if the match is to a protein with
a name such as “unknown”).
• If the best match is to a protein with < 100% identity to your query, then it is
likely that your protein is novel and you have succeeded.
• If there is a match with 100% identity but to a different species than the one
you started with, then you have succeeded in finding a novel gene.
• If there are no database matches to the original query from step (1), this
indicates that you have found a DNA/protein that is not homologous to the
original query. You should start over.
Confirm if the novel protein is hit for your
query
• Generate a multiple sequence alignment with your novel
protein, your original query protein, and a group of other
members of this family.
• A typical number of proteins to use in a multiple sequence
alignment is a minimum of 5 or 10 and a reasonable maximum
is 30.

You might also like