Whole Bioinfo Record

BT17513 – BIOINFORMATICS LABORATORY
Name : ESHA A
Reg. no : 200401037
CONTEXT:
EXPT DATE EXPERIMENT NAME PAGE NO. SIGN
NO.
1 BASIC COMMANDS
2 FILE COMMANDS
3 DIRECTORY COMMANDS
4 GENBANK
5 UNIPROT
6 BLAST
7 FASTA
8 PROTIEN DATA BANK
9 CLUSTAL W
10 PHYLOGENY
11 KEGG
12 EMBOSS
13 CATH
14 PFAM
15 SWISS MODEL
16 UCSC BROWSER
17 R PROGRAM
EX: 01 BASIC UNIX COMMANDS
DATE:
AIM: To verify the basic Unix Commands and filters.
BASIC UNIX COMMANDS:
The list of basic commands used in Unix are as follows:
1. DATE COMMAND:
Display the server date and time
Syntax :$date
Input: $date
Output: Sat, Jul 15, 8:52:15 IST 2023
2. CALENDAR COMMAND:
Display the particular month calendar or year
calendar Syntax: $cal<month name> or <year>
Input: $cal Aug 2023
Output: August 2023
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31
3. ECHO COMMAND:
It is used to print the message on the screen,whatever you happen to type on the line.
Syntax: $echo<text to be displayed on the screen>
Eg: $echo Welcome to Bioinformatics Laboaratory 2023
Input: $ echo hello
Output: hello
4. BANNER COMMAND:
It is used to print the message in large letters to gibe the impression of a banner.
Syntax :$banner<text>
Input: $banner BIOINFORMATICS
Output: -bash: banner: command not found
5. WHO COMMAND:
Display the information about all users who have logged into the system currently.
Syntax: $who
Output: vostro@dell ~
6. WHO AM I COMMAND:
It gives login details of a particular system i.e the user name, terminal, name, date and time of login
Syntax: $who am i
Output: vostro
vostro@dell ~
7. EXIT COMMAND:
It is used to logout from the user
sessions. Syntax: $exit
Output: The Cygwin screen got exit/closed
8. CLEAR COMMAND:
It is used to clear the screen.Syntax:

$clear
Output: The screen got cleared.
9. LOGNAME COMMAND:
It is used to display the current user name.
10. ID COMMAND:
The id command is used to display the numerical value that corresponds to your login name
i.e.,every valid UNIX user is assigned a login name,a user id and a group-id.
Syntax: $id
Output: uid=197609(vostro) gid=197121(None)

groups=197121(None),545(Users),4(INTERACTIVE),66049(CONSOLE LOGON),11(Authenticated
Users),15(This Organization),113(Local account),4095(CurrentSession),66048(LOCAL),262154(NTLM
Authentication),401408(Medium Mandatory Level)
11. TTY COMMAND:

The tty(teletype) command is used to know the terminal name that we are using.Syntax:
$tty
Output: /dev/pty0
12. UNAME COMMAND:

Display the name of the operating system.Syntax:
$uname
Output: CYGWIN_NT-10.0-22621
RESULT:
Thus, the basic UNIX and filter commands are verified.
EX NO.: 2 FILE COMMANDS
DATE:
AIM:
To create the file(s) and verify the file handling commands.
FILE COMMANDS:
The list of file commands used in Unix commands are as follows:
1. TOUCH COMMAND: Create file(s) with zero byte

size. Syntax: $touch <Filename>
Input: $touch biotech.
2. CAT COMMAND: Create file with data and display data in the
file. Syntax: $cat <filename>
Input: $cat testfile
Welcome to BT
Output: $ cat testfile
Welcome to BT
3. CP COMMAND: It is used to copy the contents of one file to another file and copies the file
from one folder to another folder.
Syntax: $ cp <source filename><destination filename>
Input: cp testfile new file
Output: Welcome to BT
4. MV COMMAND: It is used to rename the

file(s). Syntax: $mv <oldname><newname>
Input: $mv testfile newfile
Output: Welcome to BT
5. RM COMMAND: It is used to remove the

file(s) Syntax: $rm<filename(s)>
Input: $rm testfile
Output: The file no longer exists.
6. FILE COMMAND: It is used to determine the type of the

file. Syntax: $file <file name(s)>
Input: file testfile
Output: testfile: ASCII text
7. WC COMMAND: This command is used to display the number of lines,number of words
and number of characters in a file
Syntax: $ wc<file>
Input: $wc testfile
Output: 1731
testfile
RESULT:
Thus, the files are created and file handling commands are verified.
EX NO.: 3 DIRECTORY COMMANDS
DATE:
AIM: To create the directories and verify the directory commands.
DIRECTORY COMMANDS: The list of directory commands used in Unix commands are as
follows:
1. PWD COMMAND: It is used to know the current working
directory. Syntax: $pwd
Input: $pwd
Output: /home/91730
2. MKDIR COMMAND: It is used to create an empty

directory. Syntax: $mkdir <directory name>
Input: $mkdir file1
3. CD COMMAND: It is used to move from one directory to another

directory. Syntax: $cd<directory name>
Input: $cd file
Output: 91730@Desktop-EMB97310~file1
4. RMDIR COMMAND: It is used to remove the directory only if it is

empty. Syntax: $rmdir <directory name>
Input: $rmdir bioinfo
5. MV COMMAND: It is used to rename the directory

Syntax:$mv <old directoryname> <new directoryname>
Input: $mv file1 testfile
6. LS COMMAND: It is used to view the contents of the

directory. Syntax: $ls
Input: $ls
Output: All files in the directory are printed in alphabetical order.
RESULT: Thus all directory commands are verified.

EX NO.: 4 GENBANK
DATE:
AIM:
To retrieve the gene sequence of Oxidase from GENBANK database and to interpret the results.
DESCRIPTION:
The Genbank sequence database is an open access, annotated collection of all publically available
nucleotide sequence and their protein transmissions. This database is produced at the National Centre for
Biotechnology and Information (NCBI) as the part of the International Nucleotide sequence Database
collaboration (or) INSDC. Gen bank and its collaborators received sequences produced in the lab
throughout the world from more than 1,00,000 distinct organisms. Genbank continues to grow at an
exponential rate doubling every 18 months. Release 155 produced in August2006 contained over 65 billion
nucleotide bases in more than 61 million sequences. Gen bank is builtby direct submissions from individual
laboratories as well as from bulk submissions from large scalesequencing centres. Direct submissions are
made to Gen bank using the bank It, which is a web-page form or the stand alone submission programme
sequence. Upon receipt of a sequence submission theGenbank staff assign the accession number to the
sequence and perform quality assurance checks. The submissions are then released to the public database
where the entries are retrievable by EntreZ (or) downloadable by FTP. Bulk submissions of Expressed
Sequence Tagsite (EST), Sequence Tagsite ( STS), Genome Survey Sequence (GSS) and High Throughput
Genome Sequence(HTGS) data are most often submitted by large scale sequencing centres. The Genbank
direct submissions group also processes complete microbial genome sequences.
PROCEDURE:
1. The homepage of GENBANK database was opened and the PROTEIN database was selected
from the list of databases provided.
2. The name of the protein whose sequence is to be retrieved was entered in the textbox titled
ENTER THE KEYWORD present at the top of the page.
3. The boolean operators AND, NOT, OR and the limits were used to narrow down the search
process in order to search the protein of interest.
4. On pressing the search button, the result page displays the links of the proteins related to the
protein of interest.
5. The results were observed and interpreted.
OBSERVATION:
Locus: OV986001
Base pairs: 6722400 bp
Shape: Circular
Beginning & end of the gene: 1 to 6722400
Beginning & end of introns & exons(if present): Nil
Definition: Pseudomonas fluorescens SBW25 genome assembly, chromosome: 1.
Reference: 1, 2, 3
Journals: [1] Max Planck Institute for Evolutionary
Biology, Microbial Population Biology (Rainey Lab),
August-Thienemann-Strasse 2, 24306 Ploen,
Germany.
Taxon: 216595
Protein ID: CAI2794342.1
INTERPRETATION:
The gene sequence of Oxidase was searched in the nucleotide database; its features were retrieved and
studied. The locus is OV986001 and the gene sequence comprises 6722400 base pairsand the shape of
the sequence is circular. The beginning and end of the sequence is 1 to 6722400 whereas introns and
codons are absent since the sequence retrieved is mRNA sequence. Genbank defined it as
Pseudomonas fluorescens SBW25 genome assembly, chromosome:1. The references cited are 1, 2
and 3 . The protein ID for oxidase is found to be CAI2794342.1 and possess 216595 taxon.
RESULT:
Thus, the gene sequence of Oxidase was retrieved from Genbank and the results were interpreted.
EX NO.: 5 UNIPROT
DATE:
AIM:
To retrieve the sequence of Hemoglobin from the UNIPROT database and to interpret the results.
DESCRIPTION:
The Universal Protein resource (UNIPROT) is a comprehensive resource for protein sequence and
annotation data. The UNIPROT databases are Uniprot Knowledge base (Uniport KB), Uniprot reference
clusters (Uniref) and Uniprot archive (Uniparc). The Uniprot Metagenomic and environmental sequences
(UNIMES) database is a repository specifically developed for metagenomic and environmental data.
Uniprot is collaboration between the European Bioinformatic Institute (EBI), The Swiss Institute of
Bioinformatics (SIB) and the Protein Information Resource (PIR). The Uniprot knowledge base is the
central hub for the collection of functional information on proteins with accurate consistent and rich
annotation. In addition to capturing the core data mandatory for each Uniprot KB entry as much annotation
information as possible is added. This includes widely accepted biological ontologies, classifications, clear
indication of the quality of annotation in the form of evidence attribution of experimental and
computational data.The Uniprot KB consists of two sections: A section containing manually undertaken
records with information extracted from literature and curator, evaluated by computational analysis and a
section with computationally analyzed records that await full manual annotation. For the sake of continuity
and name recognitions, the two sections are referred to as UniprotKB/Swissprot and UniprotKB/TrEMBL.
PROCEDURE:
1. The uniprot website [uniprot.org] was opened.
2. The name of the protein to be studied was entered in the textbox titled ENTER
THE KEYWORD present at the top of the page.
3. The Boolean operators AND, NOT, OR and the limits were used to narrow down
the search process in order to search the protein of interest.
4. On pressing the search button, the result page displays the links of the proteins
related to the protein of interest.
5. The most appropriate protein is selected and the information pertaining to
the following criteria are noted :
a. NAMES AND ORIGIN:
i. Protein name:
ii. Gene names:
iii. Organism:
iv. Taxonomic identifier:
b. PROTEIN ATTRIBUTES:
i. Sequence length:
ii. Sequence status:
iii. Sequence processing:
iv. Protein existence:
c. GENERAL ANNOTATIONS:
i. Function:
ii. Sequence similarities:
d. ONTOLOGIES:
i. Biological process:
ii. Coding sequence diversity:
iii. Ligands:
iv. Molecular function:
e. SEQUENCE ANNOTATION (FEATURES):
i. Metal binding sites:
ii. Natural variation sites:
iii. Sequence conflict sites:
f. SECONDARY STRUCTURES:
i. Helices:
ii. Turns:
iii. Sheets:
OBSERVATION:
Sequence of Hemoglobin subunit epsilon (Homo sapiens)
NAMES AND ORIGIN:
1) Protein name: Hemoglobin subunit epsilon
2) Gene names: HBE1
3) Organism: Homo sapiens ( Human )
4) Taxonomic identifier: 9606 NCBI
PROTEIN ATTRIBUTES:
1) Sequence length: Amino acids 147
2) Sequence status: UniProtKB reviewed (Swiss-Prot)
3) Sequence processing: The displayed sequence is further processed into a mature form.
4) Protein existence: Evidence at protein level
GENERAL ANNOTATIONS:
1) Function: The epsilon chain is a beta-type chain of early mammalian embryonic
hemoglobin.
2) Sequence similarities: Belongs to the globin family.
ONTOLOGIES:
1) Biological process: Oxygen transport, Transport
2) Ligands:Heme, Iron, Metal-binding
3) Molecular function: Heme binding
SEQUENCE ANNOTATION (FEATURES):
1) Metal binding sites: Fe of heme b ; distal binding residue , Position 64 Fe of heme b ;
proximal binding residue, Position 93
2) Natural variation sites: Nil
3) Sequence conflict sites: 143
SECONDARY STRUCTURES:
1) Helices: 6-18, 24-35, 37-41, 44-46, 52-57, 59-76, 87-94, 102-119, 120-122, 125-142
2) Turns: 21-23, 83-86, 95-97
3) Sheets: Nil
INTERPRETATION:
Hemoglobin subunit epsilon that is present in Homo sapiens (human) is 147 aa long enzyme protein that
has HBE1 as its gene name and taxonomic identifier as 9606 NCBI in UNIPROT that has been UniProtKB
reviewed (Swiss-Prot) and processed in mature form. It belongs to the globin family and it serves in oxygen
transport. Its metal binding sites are Iron (heme distal ligand) whose position is 64 and Iron (heme proximal
ligand) whose position is 93. There are no natural variation sites and sequence conflict sites is 143.
RESULT:
Thus the sequences of Hemoglobin subunit epsilon (human) from the UNIPROT database were retrieved
and the result was interpreted.
EX NO.: 6 BLAST
DATE:
AIM:
To find the sequences similar to Cellulase (Aspergillus aculeatus) from Protein Data Bank usingblast.
DESCRIPTION:
Basic Local Alignment Search Tool (BLAST) is an algorithm for comparing biological sequence
information such as aminoacid sequence of different proteins and nucleotides of DNA sequences. A blast
search enables a researcher to compare a query sequence with a library or database sequences and
resemble query sequence above certain threshold. The blast Program has been designed for speed with a
minimal sacrifice of sensitivity to distinct sequence relationships. The scores assignedin a blast search
have a well defined statistical interpretation making real matches easier to distinguish from random
background hits. Blast uses a heuristic algorithm that seeks local as opposed to global alignment and is
therefore able to detect relationship among sequences that share only isolated regions of similarity. There
are many types of blast available from the blast webpage. Choosing the right one depends on the type of
sequence you are searching (long or short, nucleotide or protein) and the desired database.They are:
1. blast p: protein blast – search protein database using a protein query.
2. psi-blast: position specific interretive blast.
3. blast x: search protein database using a translated nucleotide query.
4. blast n: nucleotide blast- search nucleotide database using a nucleotide query.
5. tblast x: search translated nucleotide database using a translated nucleotide query.
6. tblast n: search translated nucleotide database using a protein query.
PROCEDURE:
1. Open the BLAST search page (http://blast.ncbi.nlm.nih.gov/Blast.cgi).

2. Select the appropriate program: “protein blast”.
3. If the accession number is given, enter it inside the box titled “enter accession
number”. If only the name of the protein is known, find its FASTA format in the
form of text editor. Copy and paste the FASTA sequence in the box titled “enter
FASTA sequence”.
4. Under the database tool box, choose “Uniprot kb /swiss-prot(swissprot)”.
5. Now click BLAST.
6. In the search result page, the first panel provides the information about the query id,
molecular type and query length is displayed.
7. Below this panel the sequence alignment is shown by red bars. The query sequence
aligns with other similar protein sequences in the database that range from 0 to over
4000 amino acids in length.
8. The accession number and the descriptor of the sequences are listed in the decreasing order of BLAST
score.
9. Sequences with high score and low E() value are preferred.
10. As we scroll down, the “alignment” section shows the actual amino acid
sequence aligned against our sequence.
OBSERVATION:
Protein name: Cellulare

Source: Aspergillus aculeatus
Organism: Aspergillus
aculeatus
Authors: Author ooi, T., Shinmyo, A., Okada, H. Murao. S.g
Hamaguchi, T. and Asal, M.
Title: Complete nuchiotials requence of a gene coding for Aspergillus aculeatus cellutlase.
Journal: Nuclic Aud Res 18 (19), 5884 (1990)
Query id: lcl| Query- 76352
Molecular type: amino acid
Query length: 216
No. of hits: 2
Blast score: 122 bits(305) Query coverage: 97% E.value: 4e-33
Identity: 86/222(39%)
Accession no: P22669.1
INPUT SEQUENCE:
BLAST OUTPUT:
INTERPRETATION:
The protein assessed using blast is Cellulase that belongs to the organism Aspergillus aculateus.
Cellulase is an enzyme which helps in breakdown of cellulose into monosaccharides. It hasan amino
acid of query length 216aa.Its RID number is WXUFDUPS016. It has 2 hits and Blast score of122bits. It
has a query id of lcl|query-76532
RESULT:
Thus, the sequences similar to Cellulase of sample Aspergillus aculateus were analysed using BLAST.
EX NO.: 7 FASTA
DATE:
AIM:
To find the sequences homologous and non-homologous to Globin from PIR1 database using
FASTA.
DESCRIPTION:
FASTA is a protein and DNA sequence alignment software package,first described by David.J.Hipmann
and William.R.Pearson. In 1985,in the article RAPID and sensitive protein similarities searches,the
original FASTA program was designed for protein sequence similarity searching.FASTA described in
1988, added to the ability to do DNA to DNAsearches, translated protein to DNA searches and also
provided a more sophisticated shuffling program for evaluating statistical significance. There are several
programs in this package that allow the alignment of protein sequences and DNA sequences. The current
FASTA package contains programs for protein-protein, DNA-DNA, protein-translated DNA(with frame
shifts) and ordered and unordered peptide searches. Recent versions of FASTA packages includes special
translated search algorithms that correctly handle frame-shift error(which six-framed translated searches
do not handle very well). When comparing nucleotide- protein sequence data, in addition to rapid
heuristic search methods , the FASTA provides the s-search ,an implementation of the optimal Smith
waterman algorithm. A major focus of the package is the calculation of accurate similarity statistics so
that the biologists can judge whether an alignment is likely to have occurred by chance or whether it can
be used to infer homology. A FASTA package is available from fasta.biotech.verginia.edu. The web
interface used to submit sequences and run is European Bioinformatics Institute (EBI),online database is
also available using the FASTA program.
PROCEDURE:
1. Open the FASTA page (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi).
2. Select the appropriate program: protein-protein FASTA.
3. Choose either accession number or FASTA sequence to enter in the box titled “query sequence”.
4. Now press “search database” and not to choose “compare sequences”.
5. The search page will provide the information about length of query
sequence, databases, score, number of residues in the library sequences and
statistics.
6. Sequences are aligned in the decreasing order of scores.
7. As we scroll down, the “alignment” section shows the actual amino acid sequences aligned
against our sequence.
8. The homology and non-homology is inferred by domain relationships.
OBSERVATION:
1. PROTEIN NAME : Globin

2. SEQUENCE IN PIRI DATABASE : 13144
3. SCORING MATRIX : BLOSUM 50
4. GAP PENALTIES : open= -10; extension= -2
FASTA RESULTS:
1. HIGH SCORE HOMOLOGOUS SEQUENCE:

a. PROTEIN NAME: Extracellular giant hemoglobin major globin subunit
b. ORGANISM: Oligobrachia mashikoi
c. LENGTH: 163 amino acid residues
d. SEQUENCE ID: Q7M418
e. SCORE: 134.6 bits
f. BOUNDARIES: 1-163
g. E VALUE: 4.8e-31
h. % IDENTITY: 45.4%
i. %SIMILARITY: 74.2%
2. HIGH SCORE NON-HOMOLOGOUS SEQUENCE:
a. PROTEIN NAME: Extracellular globin tylorrhynchus heterochacuts
b. SEQUENCE ID: PP02219
c. SCORE: 134.6
d. E VALUE: 4.9E-31
e. LENGTH: 139amino acid residues
INTERPRETATION:
The query sequence is globin derived from the organism Sabella spallanzanii with a length
of 165 amino acids. The scoring matrix used is BLOSUM 50 matrix with gap penalties -10 and -2. The
best alignment was found to be Extracellular giant hemoglobin major globin subunit from Oligobrachia
mashikoi having a length of 163 amino acid and sequence ID Q7M418. The E value is 4.8e-31 and
percent identity is 45.4%. The high score non-homologous sequence Extracellular globin tylorrhynchus
heterochacuts protein has the sequence id PP02219. It has a score of 134.6 and E value 4.8E-31. It has a
length of 139 amino acid.
FASTA OUTPUT:
RESULT:
The given sequence of interest was analysed using FASTA and the results were interpreted.
EX NO.: 8 PROTEIN DATA BANK
DATE:
AIM:
To retrieve the structure of the protein Cellulase from Aspergillus aculateus and viewing it in
RASMOL.
DESCRIPTION:
The Protein Data Bank (PDB) is a huge repository that contains the 3D structural data of large
biomolecules such as proteins and nucleic acids. The data typically obtained by X-ray
crystallography or NMR spectroscopy and submitted by biochemists and biologists from around
the world can be accessed at no charge on the internet. The PDB is overseen by an organization
called the World Wide Protein Data Bank. The PDB is a key source in areas ofstructural biology,
such as structural genomics. Most major scientific journals and some funding agencies such as
NIH in the USA, now require scientists to submit their structure data to the PDB. If the contents of
the PDB are thought as primary data, then there are hundreds of derived (i.e. secondary) databases
that categorize the data differently. For example, both SCOP and CATH categorize structures
according to the type of structure and assumed evolutionary relations; GO categorize structures
based on genes.
PROCEDURE:
1. Open the PDB website.

2. Type the protein name in the text box (or) type the PDB ID.
3. Press the search button and the result page will be displayed.
4. Choose the appropriate structure by double clicking the PDB ID.
5. A PDB web page will be displayed with details about the structure.
6. Download the structure file from the right hand corner of the web page.
7. Save the file as PDB file.
8. Open the RASMOL viewer to view the downloaded structure.
OBSERVATION:
● PROTEIN NAME: Cellulase
● ALTERNATIVE NAME: Endo-1,4-beta-glucanase
● GENE NAME: GUN_ASPAC
● FUNCTION: hydrolyze 1,4 linkage in beta- D- glucans, endohydrolysis of
beta D-glycosidic linkage in cellulose, lichenins.
● BIOLOGICAL PROCESSES: Cellulose catabolic process, cellulase activity
● TAXONOMIC IDENTIFIER: 5053 [NCBI]
● TYPE OF MOLECULE: crystal strucuture
● ORGANISM: Aspergillus aculateus
● PROTEIN ATTRIBUTES:
i. Sequence length: 237aa
ii. Sequence status: UniprotKB reviewed ( complete)
iii. Sequence processing: The displayed sequence is
further processed into a mature form.
● PDB ID: 3B7M
INTERPRETATION:
The protein cellulase was searched on the PDB database. The protein was said to play amajor role
in hydrolyzing cellulose in lichens and hydrolyze 1,4 linkage in beta- D- glucans. The PDB id of
the polypeptide was found to be 3B7M and the length of the polypeptide was found to be 237
amino acids.
RESULT:
Thus, the structure of the protein cellulase was retrieved from protein data bank (PDB) and its 3D
structure was viewed using RASMOL viewer.
EX NO.: 9 CLUSTAL W
DATE:
AIM:
To align sequences of luciferase from various species and to find out the similarity between those
sequences.
DESCRIPTION:
Multiple alignments of protein sequences are important tools in studying sequences. The basic
information they provide is identification of conserved sequence regions. This is very useful in
designing experiments to test and modify the function of specific proteins, in predicting the
function and structure of proteins, and in identifying new members of protein families. Sequences
can be aligned across their entire length (global alignment) or only in certain regions (local
alignment). This is true for pairwise and multiplealignments. Global alignments need to use gaps
(representing insertions/deletions) while local alignments can avoid them, aligning regions between
gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and protein
sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be
calculated from multiple alignments. The program has some adjustable parameters with reasonable
defaults.
PROCEDURE:
1. ClustalW site was opened (http://www.ebi.ac.uk/Tools/msa/clustalw2)
2. In the empty dialogue box found on the home page the given FASTA sequences
were entered to check the similarities between the sequences.
3. Entering e-mail id (if you want the results e-mailed to you) was optional.
4. The multiple sequence alignment option and the alignment type were set according
to the necessary parameters.
5. Submit button was clicked to start alignment reading.
6. When viewing your results, these are the consensus symbols used by ClustalW:
a. “*” means that the residues or nucleotides in that column are identical in
i. all Sequences in the alignment.
b. “:” means that conserved substitutions have been observed.
c. “.” means that semi-conserved substitutions are observed.
7. The button that displays Show Colors in the result page was clicked to see the
alignment results in colour code based on amino acid properties.
8. View Alignment File was clicked to see the alignment on a larger scale.
9. Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/) was also used and the
results were compared.
RESULT:
The given sequences of luciferase are evaluated by using CLUSTAL -W/ CLUSTAL -Omega
and the similarities between those sequences are found.
EX NO.: 10 PHYLOGENY
DATE:
AIM:
To study the phylogenetic relationships of nucleotide and protein sequence(s) by using
theonline tool PHYLOGENY and to do distance and parsimony analysis .
DESCRIPTION:
Phylogenetic analyses are central to many research areas in biology and typically involve the
identification of homologous sequences, their multiple alignment, the phylogenetic reconstruction
and the graphical representation of the inferred tree. The Phylogeny.fr platform transparently chains
programs to automatically perform these tasks. It is primarily designed for biologists with no
experience in phylogeny, but can also meet the needs of specialists; the first ones will find up-to-
date tools chained in a phylogeny pipeline to analyze their data in a simple and robust way, while
the specialists will be able to easily build and run sophisticated analyses. Phylogeny.fr offers three
main modes. The ‘One Click’ mode targets non-specialists and provides a ready-to-use pipeline
chaining programs with recognized accuracy and speed: MUSCLE for multiple alignment, PhyML
for tree building, and TreeDyn for tree rendering. All parameters are set up to suit most studies, and
users only have to provide their input sequences toobtain a ready-to-print tree.
PROCEDURE:
1. Retrieve the protein/nucleotide sequences for different organism from

GenBank(FASTA format).
2. Align these sequences by using ClustalW.
3. Open phylogeny.fr and choose the la-carte mode.
4. The workflow is as follows 1) Multiple sequence alignment 2) Alignment curation 3)
Construction of phylogenetic tree and 4) Visualization of tree.
5. Multiple alignment: Muscle; Alignment curation: G blocks; construction of phylogenetic
tree: parsimony
OBSERVATION:
INTERPRETATION:
Protein name: Tubulin
No. of organisms chosen: 6
1. Miliammina fusca
2. Rhabdammina cornuta
3. Allogromia laticollaris
4. Reticulomyxa filosa
5. Gammaproteobacteria bacterium
6. Lithomelissa setosa
The tree shows that Allogromia laticollaris and Lithomelissa setosa are closely related and,
Miliammina fusca , Reticulomyxa filosa exhibits closer relationship. Gammaproteobacteria
bacterium shows remote relationship with the compared organisms.
The phylogenetic tree was constructed based on the parsimony method and neighbor joining
method for the protein myosin from the five species that includes Miliammina fuscs,
Rhabdammina cornuta, Allogromia laticollaris, Reticulomyxa filosa, Lithomelissa setosa,
Gammaproteobacteria bacterium. Based on the phylogenetic tree constructed, closely related and
distantly related species are found out.
RESULT:
The phylogenetic tree was constructed and the evolutionary relationships between various
organisms was observed.
EX NO.: 11 KEGG
DATE:
AIM: To learn about protein pathway analysis using KEGG database
THEORY: KEGG stands for Kyoto Encyclopedia of Genes and Genomes. KEGG is a database
resource for understanding high-level functions and utilities of the biological system, such as the cell,
the organism and the ecosystem, from molecular-level information, especially large-scale molecular
datasets generated by genome sequencing and other high-throughput experimental technologies.
PROCEDURE:
1. Download the FASTA sequences of a diseases related protein (Alzheimer's) from a single
organism in the form of a list file from uniprot.
2. Open KEGG database and click on ti the tools from the toolbar; go to convert ID and
paste the uniprot ID of the organism in the box displayed.
3. Change the settings to uniprot, select gene and execute.
4. ID conversion result will appear; “KO“ value; which must be notted down.
5. Click on the tools, select mapper and then reconstruct. Paste the gene name and KO value
and click execute.
6. A chart will appear with disease name. By clicking on to the disease we can get its
pathway, type of gene responsible and its role.
OBSERVATION:
Name: Alzheimer disease Organism: Homo sapiens (human)
1. GENETIC INFORMATION PROCESSING:
Chromosome:
03082 ATP - dependant chromatin remodelling
2. ENVIRONMENTAL INFORMATION PROCESSING:
04330 Notch signaling:
K04505 PSEN1
K04522 PSEN2
3. CELLULAR METABOLISM:
Transport and catabolism:
04145 Phagosome
K10062 COLEC12
4. ORGANISMAL SYSTEMS:
04380 osteoclast differentiation:
K07992 TYROBP
K14378 TREM2
5. HUMAN DISEASES:
Neurodegenerative diseases:
05010 Alzhemir’s disease
K04505 PSEN1
K04520 APP
K04522 PSEN2
Description:
Alzheimer disease (AD) is a chronic disorder that slowly destroys neurons and causes serious cognitive
disability. AD is associated with senile plaques and neurofibrillary tangles (NFTs). Amyloid-beta (Abeta), a
major component of senile plaques, has various pathological effects on cell and organelle function. To date
genetic studies have revealed four genes that may be linked to autosomal dominant or familial early onset AD
(FAD). These four genes include: amyloid precursor protein (APP), presenilin 1 (PS1), presenilin 2 (PS2) and
apolipoprotein E (ApoE). All mutations associated with APP and PS proteins can lead to an increase in the
production of Abeta peptides, specfically the more amyloidogenic form, Abeta42. It was proposed that Abeta
form Ca2+ permeable pores and bind to and modulate multiple synaptic proteins, including NMDAR,
mGluR5 and VGCC, leading to the overfilling of neurons with calcium ions. Consequently, cellular Ca2+
disruptions will lead to neuronal apoptosis, autophagy deficits, mitochondrial abnormality, defective
neurotransmission, impaired synaptic plasticity and neurodegeneration in AD. FAD-linked PS1 mutation
downregulates the unfolded protein response and leads to vulnerability to ER stress.
Pathway map:
INTERPRETATION:
For the selective prefix of human organism the keyword used was “hum” according to which the organism
related pathways will be visualized. With the assistance of the “organism” icon the keyword has been obtained
and the pathways of Alzheimer disease has been entered and the respective criteria asked were entered with the
structure of pathways and the gene sequences.
RESULTS: The Alzheimer disease in the organism Homo sapiens has been interpreted successfully with the
data for the pathway identifier.
EX NO.: 12 EMBOSS
DATE:
AIM: To use EMBOSS programs to perform sequence analysis for specific DNA and protein sequences.
DESCRIPTION: EMBOSS (European Molecular Biology Open Software Suite) is an open source package
of sequence analysis tools. This software covers a wide range of functionality and can handle data in a variety
of formats. Extensive libraries are provided with the package, allowing users to develop and release their own
software. EMBOSS also integrates a range of currently available packages and tools for sequence analysis,
such as BLAST and ClustalW. A Java API (Jemboss) is also available. EMBOSS contains around 150
programs (applications).
These are just some of the areas covered:
● Sequence alignment.
● Rapid database searching with sequence patterns.
● Protein motif identification, including domain analysis.
● Nucleotide sequence pattern analysis, for example to identify CpG islands or repeats.
● Codon usage analysis for small genome.
PROGRAMS USED FOR NUCLEIC ACID SEQUENCE ANALYSIS:

1. Plotorf: Plot potential open reading frames in a nucleotide sequence.
2. Restrict: Report restriction enzyme cleavage sites in a nucleotide sequence
3. Transeq: Translate nucleic acid sequences
4. Epimer 3: Picks PCR primers and hybridization oligos
5. Backtranseq: Back-translate a protein sequence to a nucleotide sequence
6. Dan : Calculates nucleic acid melting temperature
7. Palindrome: Finds inverted repeats in nucleotide sequence(s)
8. Antigenic: Antigenic predicts potentially antigenic regions of a protein sequence,
using the method of Kolaskar and Tongaonkar.
9. Garnier: Predicts protein secondary structure.
10. Helixturnhelix: Function:Reports nucleic acid binding motif.
11. iep: Calculates iso-electric point of protein.
12. Needle
13. Water
14. Emma
15. CpGplot
16. Dotmatcher
PROCEDURE:
1. Emboss program was accessed from the website, http://emboss.bioinformatics.nl.
2. ‘Wossname” available on the homepage of the website was used to know the
functions of the different programs given in left side of the webpage.
3. ”Sort alphabetically“command was used to align the different programs according to
the alphabetical order.
4. The programs executed were, Plotorf, Restrict, Transeq, Epimer 3, Backtranseq,
Dan and palindrome, antigenic, digest, elipop, emast, ememe, enetnglyc, enetoglyc,
enetphos, etmhmm, garnier, helixturnhelix, iep and psiphi.
5. The protein/DNA sequence were typed on the dialogue box for manual uploading of
data or uploaded from the computer or uploaded from certain database for every
program.
6. RUN button was clicked to evaluate the results for the command and the results
were observed.
OBSERVATION:
Homo sapiens cMyc
Input sequence:
1. Plotorf:
2. TRANSEQ:
3. RESTRICT:
4. Backtranseq:
5. DAN:
6. PALINDROME:
Palindromes of Sequence length is: 7518
Start at position: 1
End at position: 7518
Minimum length of Palindromes is: 10
Maximum length of Palindromes is: 100
Maximum gap between elements is: 100
Number of mismatches allowed in Palindrome: 0
Palindromes:
1642 ccgtctccgg 1651
||||||||||
1699 ggcagaggcc 1690
5863 tatagtaccta 5873

|||||||||||
5890 atatcatggat 5880
7. Antigenic:
8. GARNIER:
9. IEP:
IEP of from 1 to 7518
Isoelectric Point = 4.3702
pH Bound Charge
1.00 1948.00 1.00
1.50 1947.99 0.99
2.00 1947.97 0.97
2.50 1947.92 0.92
3.00 1947.79 0.79
3.50 1947.54 0.54
4.00 1947.22 0.22
4.50 1946.92 -0.08
5.00 1946.42 -0.58
5.50 1945.06 -1.94
6.00 1940.84 -6.16
6.50 1927.64 -19.36
7.00 1887.11 -59.89
7.50 1769.59 -177.41
8.00 1478.71 -468.29
8.50 973.09 -973.91
9.00 467.56 -1479.44
9.50 176.92 -1770.08
10.00 59.65 -1887.35
10.50 19.27 -1927.73
11.00 6.13 -1940.87
11.50 1.94 -1945.06
12.00 0.62 -1946.38
12.50 0.19 -1946.81
13.00 0.06 -1946.94
13.50 0.02 -1946.98
14.00 0.01 -1946.99
INTERPRETATION:
Organism name: Homo sapiens
Gene name: cMyc
1. Plotorf: It shows many ORF in 3 frames and the largest one is from the second frame of the forward
strand is the most probable ORF
2. Restrict: Restriction site is CCGC and recognition sequences and 5’ and 3’ cut site are shown.
3. Transeq: The nucleotide sequence are translated to protein sequence
4. Backtranseq: The protein sequence are translated back to nucleotide sequence
5. Dan:
a. Accession no: NC_000008.11
b. Hit count: 7499
c. The starting and ending of the nucleotide strand are at 1 and 7499 bp
6. Palindrome:
a. Accession no: NC_000008.11
b. Length of the nucleotide: 7518
c. Palindrome sequence starts at 1 bp and ends at 7518 bp
7. Garnier:
Proteins secondary structure detail is given as:
a. Sequence: NC_000008.11 1 to 7518
b. Hitcount: 1216
c. DCH=0, DCS=0
8. IEP:
The isoelectric point is calculated as,
IEP of NC_000008.11 from 1 to 7518 is,
Isoelectric Point = 4.3702
RESULT:
Thus, using EMBOSS different function were performed on the cMyc gene.
EX NO.: 13 CATH
DATE:
AIM:
To find the protein 3D structure, evolution, function and conserved regions using the CATH database.
THEORY:
CATH stands for Class, Architecture, Topology, and Homologous. The CATH database is a free, publicly
available online resource that provides information on the evolutionary relationships of protein domains. It
was created in the mid-1990s by Professor Christine Orengo and colleagues, and continues to be developed by
the Orengo group at University College London.
PROCEDURE:
1. Type ‘CATH’ in the google search bar and open the first link. https://www.cathdb.info/
2. Click ‘protein function’ tab.
3. Copy paste the FASTA sequence in the space provided and click GO
4. Similarly do for the 3D structure, evolution and conserved regions.
5. Observe and interpret the results.
OBSERVATION:
ER membrane protein complex subunit 9 (homo sapiens)
Input sequence:
MATCHING DOMIANS:
3D STRUCTURE:
CONSERVED REGIONS:
INTERPRETATION:
The protein EMC9 is obtained from the organism Homo sapiens. It has a sequence of 208 aminoacids
Protein function: The functional annotations of proteins are identified.
3 D structure: The significant CATH structure domains that match with the protein sequence are identified.
Conserved region: The highly conserved residues and the residues that are not conserved were differentiated
on conservation score.
RESULT:
Thus, the 3D structure, evolution, function, and conserved regions of EMC9 are studied using CATH database
and the results were interpreted.
EX NO.: 14 PFAM
DATE:
AIM:
To get familiar with the Pfam database tools.
THEORY:
Pfam is a large collection of protein families and domains; each represented by multiple sequence alignments
which are constructed semi-automatically using hidden Markov models (HMMs). Pfam can be used to view
the domain organization of proteins, to view multiple alignments, protein domain architectures, protein
structures, and species distributions.The Pfam database is a large collection of protein families, each
represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally
composed of one or more functional regions, commonly termed domains. Different combinations of domains
give rise to the diverse range of proteins found in nature. The identification of domains that occur within
proteins can therefore provide insights into their function. Pfam also generates higher-level groupings of
related entries, known as clans. A clan is a collection of Pfam entries which are related by similarity of
sequence, structure or profile-HMM.The data presented for each entry is based on the UniProt Reference
Proteomes but information on individual UniProtKB sequences can still be found by entering the protein
accession. Pfam full alignments are available from searching a variety of databases, either to provide different
accessions (e.g. all UniProt and NCBI GI) or different levels of redundancy.
PROCEDURE:
1. Open the pfam website https://pfam.xfam.org/
2. Type the accession number of the required protein from the uniprot database in the
search box named as ‘jump to’. ‘Click Go’
3. Enter the accession number individually for the following:
a) Sequence search
b) View a pfam entry
c) View a clan
d) View a sequence
e) View a structure
f) Keyword search
OBSERVATION:
Protein: ER membrane protein complex subunit 9
Accession number: Q9Y3B6
Sequence length: 208 aa
Sequence search:
View a clan:
View a sequence:
INTERPRETATION:
The data is obtained from protein ER membrane protein complex subunit 9 derived from Homo sapiens. It has
a sequence of 208 aa. The accession number is Q9Y3B6. The sequence search provides details about the
significant and insignificant pfam matches. Viewing a clan helps to identify the number of sequences and
species that are related.
RESULT:
Thus, the Pfam database tools are studied and the results were interpreted.
EX NO.: 15 SWISS MODEL
DATE:
AIM:
To model EMC9 by using Swiss-Model using both automated and alignment modes.
DESCRIPTION:
SWISS-MODEL (http://swissmodel.expasy.org) is a server for automated comparative modeling of three-
dimensional (3D) protein structures. It pioneered the field of automated modeling starting in 1993 and is the
most widely-used free web-based automated modeling facility today. In 2002 the server computed 120 000
user requests for 3D protein models. SWISS-MODEL provides several levels of user interaction through its
World Wide Web interface: in the 'first approach mode' only an amino acid sequence of a protein
is submitted to build a 3D model. Template selection, alignment and model building are done completely
automated by the server. In the 'alignment mode', the modeling process is based on a user-defined
target-template alignment. Complex modeling tasks can be handled with the 'project mode' using
DeepView (Swiss-PdbViewer), an integrated sequence-to-structure workbench. All models are sent back via
email with a detailed modeling report. WhatCheck analyses and ANOLEA evaluations are provided optionally.
The reliability of SWISS-MODEL is continuously evaluated in the EVA-CM project. The SWISS-MODEL
server is under constant development to improve the successful implementation of expert knowledge into an
easy-to-use server. Here we have used alignment-mode to model 3D structure of a protein named bradykinnin
receptor whose structure is not known.
PROCEDURE:
1. The sequence of glutaredoxin was obtained from http://www.uniprot.org/ by searching in Protein
Knowledgebase. The sequence file was downloaded in fasta format
2. The template for modeling human rhodopsin was chosen by using BLAST
(http://blast.ncbi.nlm.nih.gov/Blast.cgi) to choose a homologous protein whose structure isknown. “protein
blast” was chosen and Protein Data Bank proteins(pdb) was chosen as database since it contains only
experimentally resolved structures. Protein having the best BLAST score was chosen and the corresponding
pdb file was downloaded
3. The target and template sequences were aligned using ClustalW2
(http://www.ebi.ac.uk/Tools/msa/clustalw2/) and the aligned sequences were saved in clustalW format
4. To build the 3D structure of our target, SWISSMODEL website was opened:
(http://swissmodel.expasy.org/). Alignment Mode was chosen and the target-template alignment file was
uploaded and query submitted. 1U19.A was chosen as the template and query was submitted. The result
webpage of SWISS-MODEL provided several information to evaluate the model obtained. The result
was downloaded as a zipped file
5. Evaluation of model quality was done using Anolea mean force potential, GROMOS empirical force
field energy and QMEAN to estimate the local quality of the predicted structure. Other tools available are
Whatcheck and Procheck and Ramachandran plot to analyze the stereo-chemistry of protein models and
template structures.
6. MolProbity was used to get a summary statistics of the protein model: (http://molprobity.biochem.duke.edu/)
7. The model.pdb file was uploaded to the MolProbity server. “Add Hydrogens” tool was used to
add hydrogens to the model.
8.”Start adding >” was chosen to run the Reduce program allowing it to test Asn/Gln/His flips.
9. “Regenerate H,…” button was chosen to moveon to a flip-report page. Then”Continue >” to the MolProbity
main page. “Analyze all-atom contacts and geometry” was chosen and then “Run” with default settings so that
summary statistics were viewed to judge the quality of the model.
OBSERVATION:
INPUT SEQUENCE:
>sp|Q9Y3B6|EMC9_HUMAN ER membrane protein complex subunit 9 OS=Homo sapiens OX=9606
GN=EMC9 PE=1 SV=3
MGEVEISALAYVKMCLHAARYPHAAVNGLFLAPAPRSGECLCLTDCVPLFHSHLALSVML
EVALNQVDVWGAQAGLVVAGYYHANAAVNDQSPGPLALKIAGRIAEFFPDAVLIMLDNQK
LVPQPRVPPVIVLENQGLRWVPKDKNLVMicWRDWEESRQMVGALLEDRAHQHLVDFDCHLD
DIRQDWTNQRLNTQITQWVGPTNGNGNA
BLAST OUTPUT:
CLUSTAL – W OUTPUT:
SWISS MODEL OUTPUT:
INTERPRETATION: The EMC9 molecule was modeled using Swiss model program. From the analysis, it
was inferred that there are four ligands and the oligo state is heteromer. Its SMTL id is 6ww7.1.4.
RESULT: EMC9 was modeled and various parameters were studied and results were interpreted.
EX NO.: 16 UCSC BROWSER
DATE:
AIM:
To get familiar with the Genome Browser, UCSC.
THEORY:
The UCSC Genome Browser is an on-line, and downloadable, genome browser hosted by the University of
California, Santa Cruz. The UCSC Genome Bioinformatics home page provides links to the Genome Browser
application and a variety of other useful tools: BLAT (Kent et al., 2002), for quickly mapping sequences to a
genome assembly; the Table Browser ( Karolchik et al., 2004 ; Fujita et al., 2011), for viewing and
manipulating the data underlying the Genome Browser; the Gene Sorter (Kent et al., 2005), for exploring
relationships (expression, homology, etc.) among groups of genes; VisiGene, for browsing through a large
collection of in situ mouse and frog images to examine expression patterns; the Proteome Browser (Hsu et al.,
2005), for viewing information about a selected protein; an in silico PCR tool for rapidly searching a sequence
database with a pair of PCR primers; and Genome Graphs, a tool for viewing quantities plotted along
chromosomes.
PROCEDURE:
1. Type UCSC genome browser in google search tab and click onto the first link
available, https://genome.ucsc.edu/index.html
2. Click onto the Asia specific browser.
3. Type the organism name, chromosome position or gene symbol and click Search
OBSERVATION:
INTERPRETATION:
Upon entering the requisite search key terms in the window search boxes the subjected organism keywords
were entered on the title screen. Consecutively the requisite organism was used as the reference to get the
interpreted results.
Figure 1.Indicates the title page of the browser with the necessity used for the search bar. Under which the
organism chosen was baboon, other sequences with which is to be chosen is available at the “view sequences”
tab of the organism description accordingly to the will of the user.
Upon entering the details the specific necessity was entered on the position drop box and lead to a page
represented by figure 2a. Depicting the various reference sequences to the search having the similarity to the
query in the site
RESULT:
The reference sequences for the organism baboon was observed and interpreted successfully.
EX NO.: 17 R PROGRAMMING
DATE:
GENERAL MATHEMATICAL COMMANDS

1. APPROXIMATION:
2. TRIGONOMETRIC:
3. ARITHMETIC AND TABULATION:

CONDITIONAL AND LOGARITHMIC
1. LOGARITHMIC:
2. FIND THE OUTPUT OF THE FOLLOWING:

PACKAGES
1. Dplyr:
To install the dplyr package, type the following command.
install.packages("dplyr")
To load the dplyr package, type the command below.
library(dplyr)
Important functions in Dplyr package
2. Oligoclasses
To install this package, start R (version "4.0") and enter in “R script” and click “Run”:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("oligoClasses")
3. BioMedR
To install the BioMedR package in R
install.packages('BioMedR')
4.ggplot 2
To install the ggplot2 package in R
install.packages(“ggplot2”)
To load the package
library(ggplot2)

Whole Bioinfo Record

Uploaded by

Copyright:

Available Formats

Whole Bioinfo Record

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Whole Bioinfo Record

Uploaded by

Copyright:

Available Formats

BT17513 – BIOINFORMATICS LABORATORY

8 PROTIEN DATA BANK

AIM: To verify the basic Unix Commands and filters.

BASIC UNIX COMMANDS:

The list of basic commands used in Unix are as follows:

It is used to clear the screen.Syntax:

Output: uid=197609(vostro) gid=197121(None)

11. TTY COMMAND:

12. UNAME COMMAND:

To create the file(s) and verify the file handling commands.

The list of file commands used in Unix commands are as follows:

1. TOUCH COMMAND: Create file(s) with zero byte

4. MV COMMAND: It is used to rename the

5. RM COMMAND: It is used to remove the

6. FILE COMMAND: It is used to determine the type of the

AIM: To create the directories and verify the directory commands.

2. MKDIR COMMAND: It is used to create an empty

3. CD COMMAND: It is used to move from one directory to another

4. RMDIR COMMAND: It is used to remove the directory only if it is

5. MV COMMAND: It is used to rename the directory

6. LS COMMAND: It is used to view the contents of the

RESULT: Thus all directory commands are verified.

1. Open the BLAST search page (http://blast.ncbi.nlm.nih.gov/Blast.cgi).

Protein name: Cellulare

1. PROTEIN NAME : Globin

1. HIGH SCORE HOMOLOGOUS SEQUENCE:

1. Open the PDB website.

1. Retrieve the protein/nucleotide sequences for different organism from

Protein name: Tubulin

No. of organisms chosen: 6

AIM: To learn about protein pathway analysis using KEGG database

PROGRAMS USED FOR NUCLEIC ACID SEQUENCE ANALYSIS:

5863 tatagtaccta 5873

SWISS MODEL OUTPUT:

GENERAL MATHEMATICAL COMMANDS

3. ARITHMETIC AND TABULATION:

2. FIND THE OUTPUT OF THE FOLLOWING:

You might also like