Biological Databases

BIOLOGICAL DATABASES
WHAT IS A DATABASE ?
A collection of...
structured searchable (index) updated periodically (release) cross-referenced (hyperlinks)
-> table of contents -> new edition -> links with other db
data
Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion.
DATABASES
Information system Query system Storage System Data
DATABASES
GenBank flat file PDB file Interaction Record Title of a book Book
DATABASES
Boxes Information system Query system Storage System Data Oracle MySQL PC binary files Unix text files Bookshelves
DATABASES
A List you look at A catalogue indexed files SQL grep
DATABASES
Information system Query system Storage System Data The UBC library Google Entrez SRS
TYPES OF DATABASE
Many difference database type, depending both on
the nature of the information being stored ( eg. sequences or structures) The manner of data storage( whether in flat files or in tables)
DATABASES: AN SIMPLE EXAMPLE

Accession number: 1 First Name: Amos Last Name: Bairoch Course: DEA=oct-nov-dec 2000 http://expasy4.expasy.ch/people/amos.html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; // Accession number 3: First Name: Marie-Claude Last name: Blatter Garin Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; http://expasy4.expasy.ch/people/Marie-Claude.Blatter-Garin.html //
Introduction To Database Teacher Database (ITDTdb) (flat file, 3 entries)
Easy to manage: all the entries are visible at the same time !
DATABASES: AN SIMPLE EXAMPLE (CONT.)

Relational database ( table file ):
Teacher Amos Laurent M-Claude Accession number 1 2 3 Education Biochemistry Biochemistry Biochemistry
Course DEA EMBnet
Date Oct-nov-dec 2000 Sept 2000
Involved teachers 1,3 2,3
Easier to manage; choice of the output
BIOLOGICAL DATABASE
A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided in to:
Collection of data in a form which can be easily accessed Making it available to a multi-user system ( always available for the user)
11
WHY BIOLOGICAL DATABASES ?
Explosive growth in biological data Data (sequences, 3D structures, 2D gel analysis, MS analysis.) are no longer published in a conventional manner, but directly submitted to databases Essential tools for biological research, as classical publications used to be !
13
14
Databases in general can be classified into primary, secondary and composite databases.
15
PRIMARY SEQUENCE DATABASES.
A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot &PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures. NUCLEIC ACID EMBL GenBank DDBJ PIR PROTEIN
MIPS SWISS-PROT TrEMBL NRL-3D
16
SEQUENCE DATABASES
Primary DNA
DDBJ/EMBL/GenBank GenPept/TrEMBL
Primary protein
Curated DB
RefSeq (Genomic, mRNA and protein) Swiss-Prot & PIR -> UniProt (protein)
17
NUCLEIC ACID SEQUENCE DATABASES
The
principle DNA sequence databases are DDBJ/EMBL/GenBank Which exchange data on a daily basis to ensure comprehensive coverage at each of the site.
18
NIH
NCBI
Submissions Updates
Entrez
GenBank
Submissions Updates
EMBL DDBJ
CIB EBI Submissions Updates getentry
NIG
SRS
EMBL
19
WHAT IS GENBANK?
GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.
http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
Benson et al., 2004, Nucleic Acids Res. 32:D23-D26

20
GENBANK FLAT FILE (GBFF)

LOCUS DEFINITION MUSNGH 1803 bp mRNA ROD 29-AUG-1997 Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat //
Header
Title Taxonomy Citation
Features (AA seq)
DNA Sequence
21
PROTEIN SEQUENCE DATABASES

PROTEIN
SEQUENCE DATABASES
PIR MIPS SWISS-PROT TrEMBL NRL-3D
22
PIR
(THE PROTEIN INFORMATION RESOURCE)
The protein sequence database was developed at the National Biomedical Research Foundation (NBRF) in early 1960 by Margaret Dayhoff. In the current form the PIR is spilt into four sections

PIR1-Contain fully classification and annotation entries PIR2-includes preliminary entries, which have not been thoroughly reviewed and may contain redundancy. PIR3-Unverified entries which have not been thoroughly reviewed
23
PIR4- Entries fall in to 4 categories:

1.
2.
Conceptual translations of artefactual sequences.

Conceptual translations of sequences that are not transcribed or translated; Protein sequence or Conceptual translations that are extensively genetically engineered Sequences that are not genetically encoded and not produced on ribosome.
3.
4.
24
SWISS-PROT
SWISS-PROT
is a protein sequence database was produced collaboratively by the Dept of Medical Biochemistry at the University of Geneva and the EMBL.
After
1994 the collaboration moved to EMBLS UK outstation, the EBI April 1998, it was move to the Swiss Institute of Bioinformatics (SIB)
25
In
SWISS-PROT
SWISS-PROT
incorporates:
Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc.
26
SWISS-PROT
ID AC DT DE GN OS OC RX CC CC CC CC CC CC CC CC CC CC CC DR KW FT FT SQ CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. TAXONOMY SACCHAROMYCETACEAE; SACCHAROMYCES.
CITATION -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------DISCLAMOR -------------------------------------------------------------------------DATABASE cross-reference CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RT RT RL RN RP RC RX RA RT RT RT RL RN RP RC RX RA RA RT RT RT RL RN RP RC RX RA RA RT RT RL RN RP RX RA RA RT RT RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR DR DR DR DR DR DR KW FT FT SQ
CYS3_YEAST STANDARD; PRT; 393 AA. P31373; 01-JUL-1993 (REL. 26, CREATED) 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; SACCHAROMYCETACEAE; SACCHAROMYCES. [1] SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., OHMORI S., OSHIMA T., TOH-E A.; "Cloning and characterization of the CYS3 (CYI1) gene of Saccharomyces cerevisiae."; J. BACTERIOL. 174:3339-3347(1992). [2] SEQUENCE FROM N.A., AND CHARACTERIZATION. STRAIN=DBY939; MEDLINE; 93328685. [NCBI, ExPASy, Israel, Japan] YAMAGATA S., D'ANDREA R.J., FUJISAKI S., ISAJI M., NAKAMURA K.; "Cloning and bacterial expression of the CYS3 gene encoding cystathionine gamma-lyase of Saccharomyces cerevisiae and the physicochemical and enzymatic properties of the protein."; J. BACTERIOL. 175:4800-4808(1993). [3] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93289814. [NCBI, ExPASy, Israel, Japan] BARTON A.B., KABACK D.B., CLARK M.W., KENG T., OUELLETTE B.F.F., STORMS R.K., ZENG B., ZHONG W.W., FORTIN N., DELANEY S., BUSSEY H.; "Physical localization of yeast CYS3, a gene whose product resembles the rat gamma-cystathionase and Escherichia coli cystathionine gammasynthase enzymes."; YEAST 9:363-369(1993). [4] SEQUENCE FROM N.A. STRAIN=S288C / AB972; MEDLINE; 93209532. [NCBI, ExPASy, Israel, Japan] OUELLETTE B.F.F., CLARK M.W., KENG T., STORMS R.K., ZHONG W.W., ZENG B., FORTIN N., DELANEY S., BARTON A.B., KABACK D.B., BUSSEY H.; "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes."; GENOME 36:32-42(1993). [5] SEQUENCE OF 1-18, AND CHARACTERIZATION. MEDLINE; 93289817. [NCBI, ExPASy, Israel, Japan] ONO B.-I., ISHII N., NAITO K., MIYOSHI S.-I., SHINODA S., YAMAMOTO S., OHMORI S.; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural gene and cystathionine gamma-synthase activity."; YEAST 9:389-397(1993). -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to [email protected]). -------------------------------------------------------------------------EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] PIR; S31228; S31228. YEPD; 5280; -. SGD; L0000470; CYS3. [SGD / YPD] PFAM; PF01053; Cys_Met_Meta_PP; 1. PROSITE; PS00868; CYS_MET_METAB_PP; 1. DOMO; P31373. PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P31373. PRESAGE; P31373. SWISS-2DPAGE; GET REGION ON 2D PAGE. CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
//
27
//
SWISS-PROT
28
UNIPROT
New
protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data. UniProt is a Flat-File database just like EMBL and GenBank Flat-File format is SwissProt-like, or EMBL-like
29
TREMBL
TrEMBL (Translated EMBL) was created in 1996 as computer-annotated protein sequence database supplementing to the SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISSPROT. TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which should finally be upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers have been assigned. 30
TREMBL
TrEMBL has 2 main section

SP-TrEMBL-contains
entries that will eventually be incorporated into SWISS-PROT, but that have not been manually annotated. sequences that are not destined to be included in SWISS-PROT (e.g. like peptide with less the 8 aa, and synthetic Seq)
REM-TrEMBL-contain
31
NRL-3D

The NRL-3D database is produced by PIR from sequences extracted from the ( Brookhaven protein Database PDB). Title, biological sources, bibliographic references and Medline reference are included together with secondary structure active site, binding site and modified site annotations and details of experimental methods,etc. It is a valuable resource, as it makes the sequence information in the PDB available both for keyword interrogation and for similarity searches. Many specialized protein databases for specific families or groups of proteins.
Examples:
YPD (yeast proteins), AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) etc.
32
PDB
Protein DataBase
Protein and NA 3D structures Sequence present YAFFF
33
PDB
HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES
HEADER COMPND COMPND SOURCE AUTHOR REVDAT JRNL JRNL JRNL JRNL JRNL JRNL REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES HELIX CRYST1 ORIGX1 ORIGX2 ORIGX3 SCALE1 SCALE2 SCALE3 ATOM ATOM
LEUCINE ZIPPER 15-JUL-93 1DGC GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 2 ATF/CREB SITE DNA GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC T.J.RICHMOND 1 22-JUN-94 1DGC 0 AUTH P.KONIG,T.J.RICHMOND TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA TITL 3 FLEXIBILITY REF J.MOL.BIOL. V. 233 139 1993 REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1 2 2 RESOLUTION. 3.0 ANGSTROMS. 3 3 REFINEMENT. 3 PROGRAM X-PLOR 3 AUTHORS BRUNGER 3 R VALUE 0.216 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 3 RMSD BOND ANGLES 3.86 DEGREES 3 3 NUMBER OF REFLECTIONS 3296 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 3 DATA CUTOFF 3.0 SIGMA(F) 3 PERCENT COMPLETION 98.2 3 3 NUMBER OF PROTEIN ATOMS 456 3 NUMBER OF NUCLEIC ACID ATOMS 386 4 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 4 ACID BIOSYNTHETIC ENZYMES. 5 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN4. 6 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES: 8 5' T G G A G A T G A C G T C A T C T C C 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 9 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 0 -1 0 X 117.32 X SYMM 10 -1 0 0 Y + 117.32 = Y SYMM 10 0 0 -1 Z 43.33 Z SYMM 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1 B 19 T G G A G A T G A C G T C 2 B 19 A T C T C C 1 A ALA A 228 LYS A 276 1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1.000000 0.000000 0.000000 0.00000 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 1.000000 0.00000 0.017047 0.000000 0.000000 0.00000 0.000000 0.017047 0.000000 0.00000 0.000000 0.000000 0.011539 0.00000 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82
34
7
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
ATOM ATOM TER MASTER END
842 843 844
C5 C6 46
C B C B C B 0
9 9 9 0
57.692 100.286 58.128 100.193 1 0 0 0
22.744 21.465 6 842
1.00 29.82 1.00 30.63 2 0
1DGC 1DGC 1DGC 1DGC 1DGC
916 917 918 919 920
HEADER
LEUCINE ZIPPER COMPND COMPND SOURCE AUTHOR
15-JUL-93
1DGC
1DGC
2 1DGC 1DGC 1DGC 1DGC 3 4 5 6
GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 2 ATF/CREB SITE DNA GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC T.J.RICHMOND
REVDAT
JRNL JRNL JRNL JRNL JRNL JRNL
22-JUN-94 1DGC
AUTH TITL
1DGC
1DGC 1DGC 1DGC 1DGC V. 233 139 1993 0070 1DGC 1DGC
7
8 9 10 11 12 13
P.KONIG,T.J.RICHMOND THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO
TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA TITL 3 FLEXIBILITY REF REFN J.MOL.BIOL. ASTM JMOBAK
UK ISSN 0022-2836
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK
1
2 2 RESOLUTION. 3.0 3 3 REFINEMENT. 3 3 PROGRAM AUTHORS X-PLOR BRUNGER ANGSTROMS.
1DGC
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
14
15 16 17 18 19 20
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK
3
3 3 3 3 3 3 3 3 3 3 4
R VALUE
RMSD BOND DISTANCES RMSD BOND ANGLES
0.216
0.020 3.86 ANGSTROMS DEGREES
1DGC
1DGC 1DGC 1DGC
21
22 23 24 25 26 27 28 29 30 31 32
NUMBER OF REFLECTIONS RESOLUTION RANGE DATA CUTOFF PERCENT COMPLETION
3296 10.0 - 3.0 3.0 98.2 ANGSTROMS SIGMA(F)
1DGC 1DGC 1DGC 1DGC 1DGC
35
NUMBER OF PROTEIN ATOMS NUMBER OF NUCLEIC ACID ATOMS
456 386
1DGC 1DGC 1DGC
REMARK
5 REMARK REMARK
1DGC 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN4.
35 1DGC 1DGC 36 37
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK
6
6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES:
1DGC
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
38
39 40 41 42 43 44
REMARK
REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES
8 5' T
8 9
G
1
T
2
C
3
A
4
T
5
C
6
T
7
C
8
C
9
1DGC
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC
45
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY
10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 10 10 1 A 2 A 3 A 4 A 5 A 1 B 2 B 0 -1 0 62 62 62 62 62 19 19 -1 0 0 0 0 -1 X Y Z + 117.32 117.32 43.33 = X SYMM Y SYMM Z SYMM
1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC G T C 1DGC 1DGC
ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG T A G T G C A T G C A C T G A C
36
Rasmol
37
COMPOSITE PROTEIN SEQUENCE DATABASE

Compile a composite. i.e a database that amalgamates a variety of different primary source. Easy for us to search much more efficient. NRDB,OWL,MIPSX,SWISS-PROT+TrEMBL.
38
COMPOSITE PROTEIN SEQUENCE DB

Different composite db use different primary sources and different redundancy criteria in their amalgamation procedures NRDB
PDB SWISS-PROT PIR GenPept SP update GenPept update
OWL
SWISS-PROT PIR GenBank NRL-3D
MIPSX
PIR MIPS NRL-3D SWISS-PROT EMBL translation GenBank translation Kabat (immuno) PseqIP
SPTrEMBL *
SWISS-PROT SPTrEMBL TrEMBLnew
Redundancy priority criteria
* Also called SWall at EBI
SWIR: SPTrEMBL + Wormpep
SECONDARY DATABASES
Secondary databases are the one which as reports of analyses of the sequences in the primary sources. Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO) Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM)
40
SECONDARY DATABASE
Secondary db PROSITE PROSITE PRINTS Pfam BLOCKS IDENTIFY Primary source SWISS-PROT SWISS-PROT OWL and SWISS-PROT SWISS-PROT Information Patterns
(Regular expression)
Profiles
(Weighted matrices)
Aligned motifs
(Fingerprints)
HMM
(Hidden Markov Models)
PROSITE/PRINTS Aligned motifs BLOCKS/PRINTS

Fuzzy regular expressions
PROSITE
Created in 1988 (SIB). This is the first secondary db. Contains functional domains fully annotated, based on two methods: patterns and profiles.
Helps to determine to which family of proteins a new sequence might be belong or which domain (s) or functional site it may contain.
Entries are deposited in PROSITE in two distinct files:
The process used to derive patterns involves the contruction of a multiple alignment and manual inspection to identify the conserved region. o Aug 2000: contains 1064 documentation entries that describe 1424 different patterns, rules and profiles/matrices.
Pattern/profiles with the lists of all matches in the parent version of SWISS-PROT Documentation
43
STRUCTURE OF PROSITE ENTRIES
Entries deposite in the PROSITE in two distinct files.

Patterns:-pattens and lists all the matches in the parent version of SWISS-PROT. Documentation:-provide details of the characterized family and where known description of the biological role of the chosen motif and support bibliography.
44
DETERMINING SIGNIFICANCE OF DATABASE

MATCHES
True-positive:-which
are related True- negative:-which are unrelated False-positive:-unrelated match False-negative:-correct match will fail completely to be diagnosed.
45
PROSITE (PATTERN): EXAMPLE

ID AC DT DE PA NR NR NR CC CC DR DR DR DR DR DR DO EPO_TPO; PATTERN. PS00817; OCT-1993 (CREATED); NOV-1995 (DATA UPDATE); JUL-1998 (INFO UPDATE). Erythropoietin / thrombopoeitin signature. P-x(4)-C-D-x-R-[LIVM](2)-x-[KR]-x(14)-C. /RELEASE=38,80000; /TOTAL=14(14); /POSITIVE=14(14); /UNKNOWN=0(0); /FALSE_POS=0(0); /FALSE_NEG=0; /PARTIAL=1; /TAXO-RANGE=??E??; /MAX-REPEAT=1; /SITE=3,disulfide; /SITE=11,disulfide; P48617, EPO_BOVIN , T; P33707, EPO_CANFA , T; P33708, EPO_FELCA , T; P01588, EPO_HUMAN , T; P07865, EPO_MACFA , T; Q28513, EPO_MACMU , T; P07321, EPO_MOUSE , T; P49157, EPO_PIG , T; P29676, EPO_RAT , T; P33709, EPO_SHEEP , T; P42705, TPO_CANFA , T; P40225, TPO_HUMAN , T; P40226, TPO_MOUSE , T; P49745, TPO_RAT , T; P42706, TPO_PIG , P; PDOC00644;
Diagnostic performance
List of matches
//
PROFILE Variable regions between the conserved motifs also contains information. Discriminator termed a profile, is used to indicate where the insertion and deletions (INDELs) are allowed, what type of residue are allowed, at what positions and where more conserved regions are. Profiles(weight matrices) provide a sensitive means of detecting
distant sequence relationship Place where few Residues are conserved
47
PROSITE (PROFILE): EXAMPLE

PROSITE: PS50097 ID BTB; MATRIX. AC PS50097; DT DEC-1999 (CREATED); DEC-1999 (DATA UPDATE); DEC-1999 (INFO UPDATE). DE BTB domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=67; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=62; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=.9751; R2=.02068202; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=363; N_SCORE=8.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=267; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-2; MA /I: B1=0; BI=-105; BD=-105; MA /M: SY='C'; M=-6,-10,28,-14,-9,-15,-20,-14,-19,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12; MA /M: SY='D'; M=-16,41,-28,53,15,-34,-11,-1,-33,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7; MA /M: SY='V'; M=2,-23,-8,-28,-24,-1,-24,-25,16,-20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24; MA /M: SY='T'; M=-2,-13,-18,-19,-13,-7,-24,-19,6,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13; MA /M: SY='L'; M=-11,-30,-22,-33,-24,15,-32,-23,25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24; MA /M: SY='V'; M=0,-11,-18,-13,-10,-12,-20,-13,1,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9; MA /M: SY='V'; M=1,-25,-3,-29,-25,-2,-26,-26,17,-22,10,7,-23,-25,-23,-22,-11,-3,24,-27,-10,-25; MA /M: SY='D'; M=-6,7,-26,8,7,-25,6,-7,-27,0,-23,-17,8,-13,0,-3,3,-6,-23,-27,-17,3; MA /I: I=-5; MI=0; IM=0; DM=-15; MD=-15; MA /M: SY='G'; M=-6,8,-27,8,-3,-27,22,-7,-30,-8,-26,-19,10,-14,-8,-9,2,-9,-24,-28,-21,-6;
MA /M: SY='K'; M=-7,-4,-23,-4,7,-23,-13,-2,-21,10,-18,-9,-3,-12,7,9,-2,-4,-16,-25,-12,6;

MA /M: SY='E'; M=-8,-6,-21,-8,1,-15,-21,-7,-7,-1,-10,-5,-3,-14,0,-1,-2,-2,-6,-26,-9,-1; MA /M: SY='F'; M=-12,-28,-22,-34,-26,31,-31,-21,18,-26,16,9,-22,-27,-27,-21,-20,-9,14,-6,13,-26;
PROSITE (PROFILE): EXAMPLE (CONT.)

MA /M: SY='T'; M=-3,3,-16,1,-3,-18,-12,-9,-20,-6,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5; MA /M: SY='G'; M=-1,1,-25,2,-9,-26,31,-12,-32,-10,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11;
MA /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8;

MA /M: SY='I'; M=-6,-21,-18,-25,-21,-2,-29,-21,21,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20; MA /M: SY='E'; M=-4,3,-23,3,4,-18,-11,-7,-17,-1,-18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1; MA /M: SY='I'; M=-8,-25,-23,-27,-20,1,-30,-21,21,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20; MA /M: SY='P'; M=-6,0,-24,2,1,-22,-13,-8,-21,-2,-23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3; MA /M: SY='E'; M=-7,1,-27,4,11,-24,-15,-4,-19,2,-18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7; MA /I: E1=0; IE=-105; DE=-105; NR /RELEASE=39,87397; NR /TOTAL=46(44); /POSITIVE=45(43); /UNKNOWN=1(1); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=0; CC /TAXO-RANGE=??E?V; /MAX-REPEAT=2; DR O14867, BAC1_HUMAN, T; P97302, BAC1_MOUSE, T; P97303, BAC2_MOUSE, T; DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; Q01295, BRC1_DROME, T; DR Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T; Q28068, CALI_BOVIN, T; DR Q13939, CALI_HUMAN, T; Q08605, GAGA_DROME, T; Q01820, GCL1_DROME, T; DR P10074, HKR3_HUMAN, T; Q04652, KELC_DROME, T; P42283, LOLL_DROME, T; DR P42284, LOLS_DROME, T; O14682, PI10_HUMAN, T; Q05516, PLZF_HUMAN, T; DR O43791, SPOP_HUMAN, T; P42282, TTKA_DROME, T; P17789, TTKB_DROME, T; DR P21073, VA55_VACCC, T; P24768, VA55_VACCV, T; P21037, VC02_VACCC, T; DR P17371, VC02_VACCV, T; P32228, VC04_SPVKA, T; P32206, VC13_SPVKA, T; DR P21013, VF03_VACCC, T; P24357, VF03_VACCV, T; P22611, VMT8_MYXVL, T; DR P08073, VMT9_MYXVL, T; O43167, Y441_HUMAN, T; Q10225, YAZ4_SCHPO, T; DR P40560, YIA1_YEAST, T; P34324, YKV2_CAEEL, T; P34371, YLJ8_CAEEL, T;
DR P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T;

DR Q10017, YSW1_CAEEL, T; Q13105, Z151_HUMAN, T; Q60821, Z151_MOUSE, T;
PRINTS
Compendium of protein motif fingerprints Most protein families are characterized by several conserved motifs Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part
BLOCKS
The Blocks Database contains multiple alignments of conserved regions in protein families. The database can be searched by e-mail and World Wide Web (WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences.
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks , retrieve blocks, and create new blocks, respectively. The BLOCKS Database is based on InterPro entries with sequences from SWISS-PROT and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART, and/or PFAM and/or ProDom entries
BLOCKS DATABASE
The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. The blocks created by Block Maker are created in the same manner as the blocks in the Blocks Database but with sequences provided by the user. Results are reported in a multiple sequence alignment format without calibration and in the standard Block format for searching.
FORMAT OF A BLOCK
ID short_identifier; BLOCK AC DE BL . block_number; distance from previous block = (min,max) description xxx motif; width=w; seqs=s; 99.5%=n1; strength=n2
sequence_id (offset) sequence_segment sequence_weight
.
. //
ID
line starts a block entry and contains a short identifier for the group of sequences from which the block was made. If the block was taken from InterPro, it will be the InterPro group ID. The identifier is terminated by a semi-colon, and the word "BLOCK" indicates the entry type. AC line contains the block number, a sevencharacter group number for sequences from which the block was made
DE line contains a description of the group of sequences from which the block was made xxx = the amino acids in the spaced triplet found by MOTIF upon which the block is based. w = width of the sequence segments (columns) in the block. s = number of sequence segments (rows) in the block. n1 = raw calibration score; 99.5th percentile score of true negative sequences. Raw search scores are normalized by dividing by this score and multiplying by 1000. n2 = median normalized score of known true positive sequences as documented in InterPro. Following the BL line are lines for each sequence with a segment in the block. The segments may be clustered with clusters separated by blank lines. Each segment line contains a sequence identifier, the offset from the beginning of the sequence to the block in parentheses, the sequence segment, and a weight for the segment. The weights are normalized so that the most distant segment has a weight of 100. // line terminates a block entry
PFAM
Each family has the following data: A seed alignment which is a hand edited multiple alignment representing the family.
Hidden Markov Models (HMM) derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model. A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences Annotation which contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed.
It comprises two parts: (1) Pfam-A families, which are manually annotated, and consist of a representative seed alignment, hidden Markov models (HMMs), and a full alignment of all sequences that score above the curated threshold; and (2) Pfam-B families, automatically generated clusters of similar sequence regions not matched by Pfam-A that often indicate the presence of a domain. Many of the Pfam-A families are arranged into a hierarchical classification, termed clans. You can access and download the Pfam data via the website at http://pfam.sanger.ac.uk
The data and additional features are accessible via the four websites
http://www.sanger.ac.uk/Software/Pfam/ http://pfam.wustl.edu http://pfam.jouy.inra.fr/ http://Pfam.cgb.ki.se/).
1. Go to the PROSITE site. 2. Under "Tools for PROSITE" choose ScanProsite. 3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein).
MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGT KSCVAARYMDVKGKKGPVGMPKEATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGIT LAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPVHQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVM RPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAELKPPDSEDCGEEQT RWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNN FESREACEESPFPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVD WACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVGASSARRVRKLREVMHKKTCDVLKEFLGLH
4. Start the scan. Which are the motifs that are found?
EXERCISE 1 TEXT SEARCH

1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL) and then search for human cochlin. Notice that there is a wealth of information about this protein. Furthermore, there are many links to sequence analysis tools (some of which you will learn later) and some other nice features. Note that this is merely a graphical display of the original UniProtKB/SwissProt database entry (which is in text). 2. Try to answer all of the questions below. 1. Which year was the NMR structure of the LCCL domain determined? 2. Where is the protein expressed? 3. Which diseases are associated with the protein?
EXERCISE 2 BLAST SEARCH

1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL) and then BLAST.
2. Copy the following human amino acid sequence.

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDK RSLPALTNIIKILRHDIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRH GQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDF LGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGD SIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQRIEVL DNTQQLKILADSINSEIGILCSALQKIK
3. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, use e.g. the UniProtKB database.
4. Run the search and identify the protein. Use the link provided to see the UniProtKB/SWISS-PROT report.
EXERCISE 2 BLAST SEARCH

5. Now, try to answer all of the questions below. 1. What is the SWISS-PROT primary accession number? 2. What is the common name of the protein? 3. What is the gene called? 4. Which year was the crystal structure of the catalytic domain determined? Name the first author. 5. Does the enzyme require a co-factor to function? If so, what? 6. Name the most common disease that arises as a result of deficiency of this enzyme. 7. How many amino acid residues are there in the protein? 8. What is the molecular weight of the protein?
EXERCISE 3 DOMAIN SEARCH

1. Go to the PROSITE site.
2. Under "Tools for PROSITE" choose ScanProsite.

3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein).
MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH
4. Start the scan.

Which are the motifs that are found?
EXERCISE 4 DOMAIN SEARCH

1. Go to the Pfam site. 2. Click Search by protein name or sequence. 3. Paste the sequence below into the box and choose Both Global and Fragment Pfam search.
MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH
4. Search Pfam.
1. Which domains are found? 2, What may be the function of this protein?

Biological Databases

Uploaded by

Copyright:

Available Formats

Biological Databases

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biological Databases

Uploaded by

Copyright:

Available Formats

BIOLOGICAL DATABASES

structured searchable (index) updated periodically (release) cross-referenced (hyperlinks)

Information system Query system Storage System Data

Information system Query system Storage System Data

Information system Query system Storage System Data

Many difference database type, depending both on

DATABASES: AN SIMPLE EXAMPLE

Introduction To Database Teacher Database (ITDTdb) (flat file, 3 entries)

DATABASES: AN SIMPLE EXAMPLE (CONT.)

Course DEA EMBnet

Date Oct-nov-dec 2000 Sept 2000

Involved teachers 1,3 2,3

Easier to manage; choice of the output

WHY BIOLOGICAL DATABASES ?

PRIMARY SEQUENCE DATABASES.

MIPS SWISS-PROT TrEMBL NRL-3D

NUCLEIC ACID SEQUENCE DATABASES

Benson et al., 2004, Nucleic Acids Res. 32:D23-D26

GENBANK FLAT FILE (GBFF)

Title Taxonomy Citation

Features (AA seq)

PROTEIN SEQUENCE DATABASES

PIR MIPS SWISS-PROT TrEMBL NRL-3D

PIR4- Entries fall in to 4 categories:

Conceptual translations of artefactual sequences.

TrEMBL has 2 main section

ATOM ATOM TER MASTER END

842 843 844

57.692 100.286 58.128 100.193 1 0 0 0

22.744 21.465 6 842

1.00 29.82 1.00 30.63 2 0

1DGC 1DGC 1DGC 1DGC 1DGC

916 917 918 919 920

LEUCINE ZIPPER COMPND COMPND SOURCE AUTHOR

2 1DGC 1DGC 1DGC 1DGC 3 4 5 6

P.KONIG,T.J.RICHMOND THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO

NUMBER OF REFLECTIONS RESOLUTION RANGE DATA CUTOFF PERCENT COMPLETION

3296 10.0 - 3.0 3.0 98.2 ANGSTROMS SIGMA(F)

1DGC 1DGC 1DGC 1DGC 1DGC

NUMBER OF PROTEIN ATOMS NUMBER OF NUCLEIC ACID ATOMS

1DGC 1DGC 1DGC

COMPOSITE PROTEIN SEQUENCE DATABASE

COMPOSITE PROTEIN SEQUENCE DB

Redundancy priority criteria

* Also called SWall at EBI

SWIR: SPTrEMBL + Wormpep

PROSITE/PRINTS Aligned motifs BLOCKS/PRINTS

Entries are deposited in PROSITE in two distinct files:

STRUCTURE OF PROSITE ENTRIES

Entries deposite in the PROSITE in two distinct files.

DETERMINING SIGNIFICANCE OF DATABASE

PROSITE (PATTERN): EXAMPLE

distant sequence relationship Place where few Residues are conserved

PROSITE (PROFILE): EXAMPLE

MA /M: SY='K'; M=-7,-4,-23,-4,7,-23,-13,-2,-21,10,-18,-9,-3,-12,7,9,-2,-4,-16,-25,-12,6;

PROSITE (PROFILE): EXAMPLE (CONT.)

MA /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8;

DR P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T;

sequence_id (offset) sequence_segment sequence_weight