Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
INTRODUCTION
SARS-CoV-2: The ongoing COVID-19 pandemic has been one of the most serious worldwide
pandemics in modern history, infecting 22.2 million people worldwide and causing 783,000
deaths as of August 19,2020. COVID-19 is caused SARS-CoV-2, a virus belonging to the family
coronaviridae (coronaviruses). These viruses have a single-stranded RNA genome and are
characterized by the “corona” of protein spikes surrounding the viral capsid.
Figure 1: Electron microscope image of avian coronavirus particles, showing the characteristic
spike proteins as a series of club-like projections surrounding the main viral capsule. (Source:
CDC Public Health Image Library)
NCBI: The National Center for Biotechnology Information (NCBI) is a collection of public
databases maintained by the National Institutes of Health (NIH). NCBI is primarily used to
store and distribute sequence information for genes, genomes and proteins. Scientists from
around the world can deposit and retrieve this information and can use it to perform their own
experiments and data analyses.
1
BLAST: BLAST stands for Basic Local Alignment Search Tool. It is an algorithm for quickly and
efficiently searching the millions of sequence entries in the NCBI databases and retrieving those
which have the highest similarity to an input sequence.
• Obtain genome, gene, and protein data from the NCBI public database
• Use BLAST to obtain sequences of genes/proteins similar to a reference
• Align multiple gene/protein sequences to determine conserved features within a gene
family
MATERIALS
o SeaView http://doua.prabi.fr/software/seaview
EXPERIMENTAL PROCEDURES
A. Reference File Preparation
2
2. Use the search box at the top of the screen to search for “SARS-CoV-2”. This will search
all the NCBI databases for information about the novel coronavirus. The databases on
the search page are organized into six categories:
a. Literature: Scientific publications
b. Genes: Summaries of published information about specific genes
c. Proteins: Protein sequence data and related info
d. Genomes: Genome sequence data and related info
e. Genetics: Documentation of known variants of genes
f. PubChem: Biochemical info
3. In the “Genes” category, click on “Gene” to view the results from the Gene database.
This contains organized entries for individual genes from various organisms.
4. Select “surface glycoprotein” in the search results. This is the gene encoding the
coronavirus spike or S protein. This will bring you to a page summarizing the
information about the gene in all the NCBI databases:
5. Scroll down until you find the section titled “NCBI Reference Sequences (RefSeq)”.
These are the reference nucleotide and protein sequences for the gene. The “Genomic”
subsection contains the reference nucleotide sequence from the genome, while the
“mRNA and Protein(s)” section contains the reference protein sequence.
3
6. To obtain the nucleotide sequence, click on “FASTA” next to the “Download” heading.
This will take you to a page showing the nucleotide sequence of the gene in FASTA
format.
a. To download the sequence, in the upper right-hand corner of the page click on
“send to”, select “file”, make sure the selected format is “FASTA”, then click
“create file”. This will download a file called “sequence.fasta”.
7. To obtain the protein sequence, go back to the gene page and click on the first link under
the “mRNA and Protein(s)” heading. This is the accession number for the protein
sequence and will take you to the protein database entry for the gene.
a. Repeat step 6a on this page to download the sequence. Make sure the format is
“FASTA” before you download.
b. Rename the resulting file to something more informative before you continue.
B. BLAST
4
Four different basic local alignment search tools are available:
• Nucleotide BLAST – searches nucleotide database
• Protein BLAST – searches protein sequence database
• blastx - translates a nucleotide sequence into protein and searches against the protein
database
• tblastn – deduces a nucleotide sequence from a protein sequence and searches the
nucleotide database
For more information about each of the options in the figure above, refer to Box 1 below.
You can also specify a subrange of the entered sequence to search using the
“Query Subrange” options and give the search a descriptive title using the
“Job Title” option.
5
Box 1 continued
Program Selection
Allows you to choose the algorithm used to perform the search. There are a number of
variations on the BLAST algorithm which are optimized for different purposes; for
example, PSI-BLAST, a type of protein BLAST, searches the database iteratively to
collect more distantly related sequences that match the input pattern of protein
features. In most cases, the default options (MEGABLAST for nucleotide, blastp for
protein) are best for quickly searching the databases for similar sequences.
6
13. The first part of the screen simply gives details of the search as well as options for
filtering the results. The actual results are shown in the second part of the screen:
14. The important columns to note from the figure in Step 13 are:
• Description tells you the name of the gene/protein as well as the organism it comes
from.
• Query Cover tells you how much of the query matches the result, also called the
subject.
7
• E value is a general score of how well the query and the subject match – smaller
numbers mean the result is a better match.
• Per. Ident. (percent identity) tells you how many of the bases/amino acids are
identical between the query and the subject.
You may note that all or almost all of the results of your BLAST come from SARS-CoV-2
genomes and are 100% identical to your query. This is not very useful for gaining
insight into the structure or evolution of the spike protein, so you will need to filter out
SARS-CoV-2 from your results.
15. Go back to the BLAST page and enter “SARS-CoV-2” into the “organism” box in the
“Choose Search Set” section, and make sure the “exclude” box next to it is checked.
Then click “BLAST” again to redo the BLAST search.
16. Scroll through your results and un-check any that are labeled “synthetic construct”,
“recombinant”, “clone” or similar. These are artificial nucleotide sequences which will
not be useful for our data analyses.
17. In the top bar, select “Download”, then “FASTA (aligned sequences)”. This will
download all the selected sequences into a single FASTA file called “seqdump.txt”.
Again, rename this to something more informative before continuing.
18. Go back to the main BLAST page, but now select “Protein BLAST”. The options for the
protein BLAST are very similar to those for the nucleotide BLAST (see Box 1). Note: the
search can take anywhere from 3-7 minutes but will show a loading screen that refreshes every
few seconds if it is working.
19. Repeat the appropriate steps above with your protein sequence file.
When you download the protein file, select “FASTA (complete sequences)” instead of
“FASTA (aligned sequences)” to download the entire protein sequence for each record
instead of just the aligned portion. (The reason we didn’t do this for the nucleotide
sequences is because they mostly come from genome data, so downloading the complete
sequences would’ve downloaded the entire genome of each sample rather than just our
desired gene!)
We will be using Multiple Alignment by Fast Fourier Transform (MAFFT), an online alignment
program, to perform multiple sequence alignments.
8
a. Open your sequence files from the previous steps using a text editing program
(Notepad, TextEdit, etc. NOT Word). The FASTA file format can be read in a text
editing program but you may have to manually select 'open with' to do so.
b. Copy your SARS-CoV-2 nucleotide sequence and paste it at the beginning of the
collection of nucleotide sequences you obtained in the last section. Save the file.
c. Repeat Step 20b with your SARS-CoV-2 protein sequence and your collection of
protein sequences from the previous section.
22. Select “Choose File” and upload the nucleotide sequences that you just edited.
23. Make sure all the options in the second box have the “same as input” option selected.
This minimizes formatting differences between the input and output files.
24. Click “Submit” and wait for the alignment to finish. This should take no more than a
few minutes. Once the alignment is processed, you will be redirected to a page that
looks like this:
25. The default output format for MAFFT is a format called “Clustal”. To download a
FASTA formatted file containing the alignment, select “Fasta format” at the top of the
page. (Note that as usual, the file will have a strange name by default and should be
renamed before you continue.)
9
26. Open your preferred alignment viewer (e.g. AliView or SeaView) and drag the file you
just downloaded into the window. You should see something like this:
Note that each nucleotide is displayed in a different color, so you can easily see which
sequences are aligned and which are not.
27. Repeat Steps 22-26 using your protein sequences. When you open the protein alignment
in SeaView, it looks something like this:
Note that the amino acids in this alignment are colored based on their properties. Make
sure to pay attention to the actual text of the alignment to spot mutations.
10
DATA ANALYSIS
Nucleotide Sequences:
• From your BLAST results (on the NCBI website), which coronavirus sequence is most
similar to SARS-CoV-2?
• Find this sequence in your alignment. How many mutations are present between it and
SARS-CoV-2?
o Hint: In SeaView you can change the order of sequences by highlighting the
sequence name, then ctrl-click on another sequence name to move the
highlighted sequence next to it.
• You should have at least one SARS coronavirus sequence in your alignment. Compare
this sequence to SARS-CoV-2, as you did for the previous sequence.
Protein Sequences:
• Repeat the analysis of the nucleotide sequences with your protein data.
o Are your results the same?
o Why or why not?
• Perform a literature search to find the domains of the Coronavirus Spike protein.
Identify the domains in your alignment.
o Which domain is the most/least highly conserved? Why do you think this is?
o Based on your reading, which mutation(s) in SARS-CoV-2 do you think has the
greatest impact on the function of the protein? Explain your reasoning.
11