Biological Databases
Biological Databases
Biological Databases
WHAT IS A DATABASE ?
A collection of...
-> table of contents -> new edition -> links with other db
data
Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion.
DATABASES
DATABASES
GenBank flat file PDB file Interaction Record Title of a book Book
DATABASES
Boxes Information system Query system Storage System Data Oracle MySQL PC binary files Unix text files Bookshelves
DATABASES
A List you look at A catalogue indexed files SQL grep
DATABASES
Information system Query system Storage System Data The UBC library Google Entrez SRS
TYPES OF DATABASE
the nature of the information being stored ( eg. sequences or structures) The manner of data storage( whether in flat files or in tables)
Easy to manage: all the entries are visible at the same time !
BIOLOGICAL DATABASE
A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided in to:
Collection of data in a form which can be easily accessed Making it available to a multi-user system ( always available for the user)
11
Explosive growth in biological data Data (sequences, 3D structures, 2D gel analysis, MS analysis.) are no longer published in a conventional manner, but directly submitted to databases Essential tools for biological research, as classical publications used to be !
13
14
Databases in general can be classified into primary, secondary and composite databases.
15
A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot &PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures. NUCLEIC ACID EMBL GenBank DDBJ PIR PROTEIN
16
SEQUENCE DATABASES
Primary DNA
DDBJ/EMBL/GenBank GenPept/TrEMBL
Primary protein
Curated DB
RefSeq (Genomic, mRNA and protein) Swiss-Prot & PIR -> UniProt (protein)
17
The
principle DNA sequence databases are DDBJ/EMBL/GenBank Which exchange data on a daily basis to ensure comprehensive coverage at each of the site.
18
NIH
NCBI
Submissions Updates
Entrez
GenBank
Submissions Updates
EMBL DDBJ
CIB EBI Submissions Updates getentry
NIG
SRS
EMBL
19
WHAT IS GENBANK?
GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.
http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
Header
DNA Sequence
21
SEQUENCE DATABASES
22
PIR
(THE PROTEIN INFORMATION RESOURCE)
The protein sequence database was developed at the National Biomedical Research Foundation (NBRF) in early 1960 by Margaret Dayhoff. In the current form the PIR is spilt into four sections
PIR1-Contain fully classification and annotation entries PIR2-includes preliminary entries, which have not been thoroughly reviewed and may contain redundancy. PIR3-Unverified entries which have not been thoroughly reviewed
23
3.
4.
24
SWISS-PROT
SWISS-PROT
is a protein sequence database was produced collaboratively by the Dept of Medical Biochemistry at the University of Geneva and the EMBL.
After
1994 the collaboration moved to EMBLS UK outstation, the EBI April 1998, it was move to the Swiss Institute of Bioinformatics (SIB)
25
In
SWISS-PROT
SWISS-PROT
incorporates:
Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc.
26
SWISS-PROT
ID AC DT DE GN OS OC RX CC CC CC CC CC CC CC CC CC CC CC DR KW FT FT SQ CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. TAXONOMY SACCHAROMYCETACEAE; SACCHAROMYCES.
CITATION -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------DISCLAMOR -------------------------------------------------------------------------DATABASE cross-reference CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RT RT RL RN RP RC RX RA RT RT RT RL RN RP RC RX RA RA RT RT RT RL RN RP RC RX RA RA RT RT RL RN RP RX RA RA RT RT RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR DR DR DR DR DR DR KW FT FT SQ
CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; SACCHAROMYCETACEAE; SACCHAROMYCES. [1] SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., OHMORI S., OSHIMA T., TOH-E A.; "Cloning and characterization of the CYS3 (CYI1) gene of Saccharomyces cerevisiae."; J. BACTERIOL. 174:3339-3347(1992). [2] SEQUENCE FROM N.A., AND CHARACTERIZATION. STRAIN=DBY939; MEDLINE; 93328685. [NCBI, ExPASy, Israel, Japan] YAMAGATA S., D'ANDREA R.J., FUJISAKI S., ISAJI M., NAKAMURA K.; "Cloning and bacterial expression of the CYS3 gene encoding cystathionine gamma-lyase of Saccharomyces cerevisiae and the physicochemical and enzymatic properties of the protein."; J. BACTERIOL. 175:4800-4808(1993). [3] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93289814. [NCBI, ExPASy, Israel, Japan] BARTON A.B., KABACK D.B., CLARK M.W., KENG T., OUELLETTE B.F.F., STORMS R.K., ZENG B., ZHONG W.W., FORTIN N., DELANEY S., BUSSEY H.; "Physical localization of yeast CYS3, a gene whose product resembles the rat gamma-cystathionase and Escherichia coli cystathionine gammasynthase enzymes."; YEAST 9:363-369(1993). [4] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93209532. [NCBI, ExPASy, Israel, Japan] OUELLETTE B.F.F., CLARK M.W., KENG T., STORMS R.K., ZHONG W.W., ZENG B., FORTIN N., DELANEY S., BARTON A.B., KABACK D.B., BUSSEY H.; "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes."; GENOME 36:32-42(1993). [5] SEQUENCE OF 1-18, AND CHARACTERIZATION. MEDLINE; 93289817. [NCBI, ExPASy, Israel, Japan] ONO B.-I., ISHII N., NAITO K., MIYOSHI S.-I., SHINODA S., YAMAMOTO S., OHMORI S.; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural gene and cystathionine gamma-synthase activity."; YEAST 9:389-397(1993). -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to [email protected]). -------------------------------------------------------------------------EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] PIR; S31228; S31228. YEPD; 5280; -. SGD; L0000470; CYS3. [SGD / YPD] PFAM; PF01053; Cys_Met_Meta_PP; 1. PROSITE; PS00868; CYS_MET_METAB_PP; 1. DOMO; P31373. PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P31373. PRESAGE; P31373. SWISS-2DPAGE; GET REGION ON 2D PAGE. CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
//
27
//
SWISS-PROT
28
UNIPROT
New
protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data. UniProt is a Flat-File database just like EMBL and GenBank Flat-File format is SwissProt-like, or EMBL-like
29
TREMBL
TrEMBL (Translated EMBL) was created in 1996 as computer-annotated protein sequence database supplementing to the SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISSPROT. TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which should finally be upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers have been assigned. 30
TREMBL
entries that will eventually be incorporated into SWISS-PROT, but that have not been manually annotated. sequences that are not destined to be included in SWISS-PROT (e.g. like peptide with less the 8 aa, and synthetic Seq)
REM-TrEMBL-contain
31
NRL-3D
The NRL-3D database is produced by PIR from sequences extracted from the ( Brookhaven protein Database PDB). Title, biological sources, bibliographic references and Medline reference are included together with secondary structure active site, binding site and modified site annotations and details of experimental methods,etc. It is a valuable resource, as it makes the sequence information in the PDB available both for keyword interrogation and for similarity searches. Many specialized protein databases for specific families or groups of proteins.
Examples:
YPD (yeast proteins), AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) etc.
32
PDB
Protein DataBase
Protein and NA 3D structures Sequence present YAFFF
33
PDB
HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES
HEADER COMPND COMPND SOURCE AUTHOR REVDAT JRNL JRNL JRNL JRNL JRNL JRNL REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES HELIX CRYST1 ORIGX1 ORIGX2 ORIGX3 SCALE1 SCALE2 SCALE3 ATOM ATOM
LEUCINE ZIPPER 15-JUL-93 1DGC GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 2 ATF/CREB SITE DNA GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC T.J.RICHMOND 1 22-JUN-94 1DGC 0 AUTH P.KONIG,T.J.RICHMOND TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA TITL 3 FLEXIBILITY REF J.MOL.BIOL. V. 233 139 1993 REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1 2 2 RESOLUTION. 3.0 ANGSTROMS. 3 3 REFINEMENT. 3 PROGRAM X-PLOR 3 AUTHORS BRUNGER 3 R VALUE 0.216 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 3 RMSD BOND ANGLES 3.86 DEGREES 3 3 NUMBER OF REFLECTIONS 3296 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 3 DATA CUTOFF 3.0 SIGMA(F) 3 PERCENT COMPLETION 98.2 3 3 NUMBER OF PROTEIN ATOMS 456 3 NUMBER OF NUCLEIC ACID ATOMS 386 4 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 4 ACID BIOSYNTHETIC ENZYMES. 5 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN4. 6 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES: 8 5' T G G A G A T G A C G T C A T C T C C 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 9 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 0 -1 0 X 117.32 X SYMM 10 -1 0 0 Y + 117.32 = Y SYMM 10 0 0 -1 Z 43.33 Z SYMM 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1 B 19 T G G A G A T G A C G T C 2 B 19 A T C T C C 1 A ALA A 228 LYS A 276 1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1.000000 0.000000 0.000000 0.00000 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 1.000000 0.00000 0.017047 0.000000 0.000000 0.00000 0.000000 0.017047 0.000000 0.00000 0.000000 0.000000 0.011539 0.00000 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82
34
7
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
C5 C6 46
C B C B C B 0
9 9 9 0
HEADER
15-JUL-93
1DGC
1DGC
GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 2 ATF/CREB SITE DNA GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC T.J.RICHMOND
REVDAT
JRNL JRNL JRNL JRNL JRNL JRNL
22-JUN-94 1DGC
AUTH TITL
1DGC
1DGC 1DGC 1DGC 1DGC V. 233 139 1993 0070 1DGC 1DGC
7
8 9 10 11 12 13
TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA TITL 3 FLEXIBILITY REF REFN J.MOL.BIOL. ASTM JMOBAK
UK ISSN 0022-2836
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK
1
2 2 RESOLUTION. 3.0 3 3 REFINEMENT. 3 3 PROGRAM AUTHORS X-PLOR BRUNGER ANGSTROMS.
1DGC
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
14
15 16 17 18 19 20
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK
3
3 3 3 3 3 3 3 3 3 3 4
R VALUE
RMSD BOND DISTANCES RMSD BOND ANGLES
0.216
0.020 3.86 ANGSTROMS DEGREES
1DGC
1DGC 1DGC 1DGC
21
22 23 24 25 26 27 28 29 30 31 32
35
456 386
REMARK
5 REMARK REMARK
1DGC 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN4.
35 1DGC 1DGC 36 37
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK
6
6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES:
1DGC
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
38
39 40 41 42 43 44
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES
8 5' T
8 9
G
1
T
2
C
3
A
4
T
5
C
6
T
7
C
8
C
9
1DGC
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
45
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY
10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 10 10 1 A 2 A 3 A 4 A 5 A 1 B 2 B 0 -1 0 62 62 62 62 62 19 19 -1 0 0 0 0 -1 X Y Z + 117.32 117.32 43.33 = X SYMM Y SYMM Z SYMM
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC G T C 1DGC 1DGC
ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG T A G T G C A T G C A C T G A C
36
Rasmol
37
38
OWL
SWISS-PROT PIR GenBank NRL-3D
MIPSX
PIR MIPS NRL-3D SWISS-PROT EMBL translation GenBank translation Kabat (immuno) PseqIP
SPTrEMBL *
SWISS-PROT SPTrEMBL TrEMBLnew
SECONDARY DATABASES
Secondary databases are the one which as reports of analyses of the sequences in the primary sources. Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO) Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM)
40
SECONDARY DATABASE
Secondary db PROSITE PROSITE PRINTS Pfam BLOCKS IDENTIFY Primary source SWISS-PROT SWISS-PROT OWL and SWISS-PROT SWISS-PROT Information Patterns
(Regular expression)
Profiles
(Weighted matrices)
Aligned motifs
(Fingerprints)
HMM
(Hidden Markov Models)
PROSITE
Created in 1988 (SIB). This is the first secondary db. Contains functional domains fully annotated, based on two methods: patterns and profiles.
Helps to determine to which family of proteins a new sequence might be belong or which domain (s) or functional site it may contain.
The process used to derive patterns involves the contruction of a multiple alignment and manual inspection to identify the conserved region. o Aug 2000: contains 1064 documentation entries that describe 1424 different patterns, rules and profiles/matrices.
Pattern/profiles with the lists of all matches in the parent version of SWISS-PROT Documentation
43
44
are related True- negative:-which are unrelated False-positive:-unrelated match False-negative:-correct match will fail completely to be diagnosed.
45
Diagnostic performance
List of matches
//
PROFILE Variable regions between the conserved motifs also contains information. Discriminator termed a profile, is used to indicate where the insertion and deletions (INDELs) are allowed, what type of residue are allowed, at what positions and where more conserved regions are. Profiles(weight matrices) provide a sensitive means of detecting
47
PRINTS
Compendium of protein motif fingerprints Most protein families are characterized by several conserved motifs Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part
BLOCKS
The Blocks Database contains multiple alignments of conserved regions in protein families. The database can be searched by e-mail and World Wide Web (WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences.
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks , retrieve blocks, and create new blocks, respectively. The BLOCKS Database is based on InterPro entries with sequences from SWISS-PROT and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART, and/or PFAM and/or ProDom entries
BLOCKS DATABASE
The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. The blocks created by Block Maker are created in the same manner as the blocks in the Blocks Database but with sequences provided by the user. Results are reported in a multiple sequence alignment format without calibration and in the standard Block format for searching.
FORMAT OF A BLOCK
ID short_identifier; BLOCK AC DE BL . block_number; distance from previous block = (min,max) description xxx motif; width=w; seqs=s; 99.5%=n1; strength=n2
.
. //
ID
line starts a block entry and contains a short identifier for the group of sequences from which the block was made. If the block was taken from InterPro, it will be the InterPro group ID. The identifier is terminated by a semi-colon, and the word "BLOCK" indicates the entry type. AC line contains the block number, a sevencharacter group number for sequences from which the block was made
DE line contains a description of the group of sequences from which the block was made xxx = the amino acids in the spaced triplet found by MOTIF upon which the block is based. w = width of the sequence segments (columns) in the block. s = number of sequence segments (rows) in the block. n1 = raw calibration score; 99.5th percentile score of true negative sequences. Raw search scores are normalized by dividing by this score and multiplying by 1000. n2 = median normalized score of known true positive sequences as documented in InterPro. Following the BL line are lines for each sequence with a segment in the block. The segments may be clustered with clusters separated by blank lines. Each segment line contains a sequence identifier, the offset from the beginning of the sequence to the block in parentheses, the sequence segment, and a weight for the segment. The weights are normalized so that the most distant segment has a weight of 100. // line terminates a block entry
PFAM
Each family has the following data: A seed alignment which is a hand edited multiple alignment representing the family.
Hidden Markov Models (HMM) derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model. A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences Annotation which contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed.
It comprises two parts: (1) Pfam-A families, which are manually annotated, and consist of a representative seed alignment, hidden Markov models (HMMs), and a full alignment of all sequences that score above the curated threshold; and (2) Pfam-B families, automatically generated clusters of similar sequence regions not matched by Pfam-A that often indicate the presence of a domain. Many of the Pfam-A families are arranged into a hierarchical classification, termed clans. You can access and download the Pfam data via the website at http://pfam.sanger.ac.uk
The data and additional features are accessible via the four websites
1. Go to the PROSITE site. 2. Under "Tools for PROSITE" choose ScanProsite. 3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein).
MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGT KSCVAARYMDVKGKKGPVGMPKEATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGIT LAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPVHQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVM RPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAELKPPDSEDCGEEQT RWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNN FESREACEESPFPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVD WACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVGASSARRVRKLREVMHKKTCDVLKEFLGLH
4. Start the scan. Which are the motifs that are found?
3. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, use e.g. the UniProtKB database.
4. Run the search and identify the protein. Use the link provided to see the UniProtKB/SWISS-PROT report.
4. Search Pfam.
1. Which domains are found? 2, What may be the function of this protein?