Biological Databases: DR Z Chikwambi Biotechnology

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

Biological

Databases

Dr Z Chikwambi
Biotechnology
Objectives of lecture:
You are expected to describe/ understand
the following:

1. To describe databases.
2. Describe features of a database.
3. Describe an ideal database.
4. Describe the purposes of databases.
5. Describe how databases are integrated
supported by resource portals.
6. Describe problems associated with
databases.
DATABASES
Database or databank?

Initially
• Databank (in UK)
• Database (in the USA)

Solution

• The abbreviation db
What is a Database?

A structured collection of data held in computer storage; esp. one that


incorporates software to make it accessible in a variety of ways; transf., any
large collection of information.

database management: the organization and manipulation of data in a


database.

database management system (DBMS): a software package that provides


all the functions required for database management.

database system: a database together with a database management system.


What is a database?
• A collection of data
– structured
– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with other db

• Includes also associated tools (software) necessary for


access, updating, information insertion, information
deletion….

• Data storage management: flat files, relational databases…

6
What are Databases?
• A database is a structured collection of
information.
– A database consists of basic units called records or
entries.
– Each record consists of fields, which hold pre-
defined data related to the record.
– For example, a protein database would have
protein sequences as records and protein
properties as fields (e.g., name of protein, length,
amino-acid sequence, …)
Database: a « flat file » example
Flat-file database («  flat file, 3 entries  »):

Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: Pottery 2000; Pottery 2001;
//
Accession number: 2
First Name: Dan
Last name: Graur
Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002
//
Accession number 3:
First Name: John
Last name: Travolta
Course: Ballet 2001; Ballet 2002;
//
• Easy to manage: all the entries are visible at the same time !
8
• A Database
– Can be thought of as a large table, where the rows
represent records and the columns represent fields.

Field Name Length Sequence Enzyme


Record
QA001 MTGA 243 MYQWI… yes
QA002 Ribosomal 267 MAAPV… no
protein L9
QA003 Flagellin 374 GSSIL… no
QA004 GDPMH 157 MFLRQ… yes

Accession Numbers: Unique identifiers of the


database records. 9
Database: a « relational » example
Relational database (« table file »):

Teacher Accession Education


number
Amos 1 Biochemistry
Dan 2 Genetics
John 3 Scientology

Course Year Involved


teachers
Advanced 2000; 2001 1; 2
Pottery
Ballet for Fat 2001; 2002 2; 3
People
10
Ideal Minimal Content of an Entry
in a Sequence Database
• Sequence
• Accession number (AC)
• Taxonomic data
• References
• Annotation/Curation
• Keywords
• Cross-references
• Documentation
• Source of data
Why biological databases?
• Exponential growth in biological data.

• Data (genomic sequences, 3D structures, 2D gel


analysis, MS analysis, Microarrays….) are no longer
published in a conventional manner, but directly
submitted to databases.

• Essential tools for biological research. The only way to


publish massive amounts of data without using all the
paper in the world.
12
Primary & Secondary Databases
• Primary
– Sequence information
– Structural information

• Secondary
– Information derived from primary databases
– Signature sequences, domains, motiffs, pathways, diseases

• Integrated
Categories of databases for Life Sciences
• Sequences (DNA, protein)
• Genomics
• Mutation/polymorphism
• Protein domain/family
• Proteomics (2D gel, Mass Spectrometry)
• 3D structure
• Metabolic networks
• Regulatory networks
• Bibliography
• Expression (Microarrays,…)
• Specialized
14
Databases: Examples
• Integrated systems of databases:
» E.g., NCBI (Protein, Nucleotide, Gene, OMIM, BLAST etc.)

• Protein Databases:
» E.g., ExPASy (SwissProt + TrEMBL)

• Protein structures:
» E.g., PDB and PDBsum

• Pathway databases:
» E.g., KEGG (Kyoto Encyclopedia of Genes and Genomes)

15
Databases: Protein sequences
• SWISS-PROT: created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/

• TrEMBL: created in 1996; complement to SWISS-PROT; derived from


EMBL CDS translations (« proteomic » version of EMBL)

• PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

• Genpept: « proteomic » version of GenBank

• Many specialized protein databases for specific families or groups of


proteins.

– Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM


receptors), IMGT (immune system), YPD (Yeast), etc.
Sequence-based Database
Searching
• Basic Assumptions:
• Sequences of homologous Genes/Protein diverge
over time even though structure and/or function
change little
• Significant sequence similarity inferred as potential
structural /functional similarity or common evolutionary
origin
• Based on well-characterised protein, infer the function
of an unknown sequence at gene or protein sequence
level.
Text based Database/ searching: Eg.
PubMed
• Terminology: query, hit, fields, logical/Boolean
operator.
• General principles:
1. All main databases provide a convenient tool for text base
searching.
2. We can search for key/query words in specific fields.
3. We can search more than one database at a time.
4. We can pose additional limits, such as modification date.

Eg. PubMed
– Contains entries for more than 11 million abstracts of scientific
publications.
– Enables user to do keyword searches, provides links to a selection of
full articles, and has text mining capabilities, e.g. provides links to
related articles, and GenBank entries, among others.
– Efficient searching PubMed requires some skill. For example, 18
searching with a keyword “interleukin” returns 108,366 matches.
The OMIM (Online Mendelian
Inheritance in Man)

– Online Mendelian Inheritance in Man (OMIM) is a


database that catalogues all the known diseases with a
genetic component, and when possible links them to the
relevant genes in the human genome and provides
references for further research and tools for genomic
analysis of a catalogued gene.
• Genes and genetic disorders
• Edited by team at Johns Hopkins
• Updated daily
Searching OMIM
• Search Fields
– Name of trait, e.g., hypertension
– Cytogenetic location, e.g., 1p31.6
– Inheritance, e.g., autosomal dominant
– Gene, e.g., coagulation factor VIII

20
OMIM search tags

All Fields [ALL]


Allelic Variant [AV] or [VAR]
Chromosome [CH] or [CHR]
Clinical Synopsis [CS] or [CLIN]
Gene Map [GM] or [MAP]
Gene Name [GN] or [GENE]
Reference [RE] or [REF]

21
Start working: Tarea

Open: UniprotKB
Search for nitrogen fixing genes

How many genes are they?


How many have been reviewed?
Categorize the genes into types/function.

22
What is Google Scholar?

Enables you to search specifically for scholarly


literature, including peer-reviewed papers, theses,
books, preprints, abstracts and technical reports from
all broad areas of research.

23
What other DATA can we retrieve from the record?

24
There are approximately
286,730,369,256 sequence
records in the traditional
GenBank divisions as of 2011.
EMBL: The Genome divisions
http://www.ebi.ac.uk/genomes/

Schizosaccharomyces pombe strain 972h- complete genome


GOBASE: resource for organelle genomes

http://megasun.bch.umontreal.ca/gobase/
Single Nucleotide Polymorphisms
(SNPs) Database
Within a database, the format needs to be
consistent.
A SwissProt entry, in Fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).


MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAENITTGCAEHCSLN
E
NITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRGQALLVNSSQPWEPLQLHVDKA
VSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACR
TGDR
Why Databases?

• The purpose of databases is not merely to collect and


organize data, but to allow intelligent data retrieval.
• A query is a method to retrieve information from
the database.
• The organization of each record into predetermined
fields, allows us to use queries on fields.
Databases on the Internet
• Biological databases often have web
interfaces, which allow users to send queries to
the databases.
• Some databases can be accessed by different
web servers, each offering a different interface.

request query

web page result

User Web server Database server


Database Download

• Nearly all biological databases are available for


download
» As simple text (flat) files.

• A local version of the database allows one


greater freedom in processing the data.
• Processing data in files requires some
computer-programming skills.
» PERL is an easy programming language that can be used for
extraction and analysis of data from files.
The “Perfect” Database
1. Comprehensive, but easy to search.

2. Annotated, but not “too annotated”.

3. A simple, easy to understand structure.

4. Cross-referenced.

5. Minimum redundancy.

6. Easy retrieval of data. 37


Problems with General Sequence
Databases
• Databases that strive for encyclopedic completeness are now
so huge as to be close to unmanageable.

1. Redundancy (nothing ever goes out, correct or wrong).


2. Inadequate sequence quality.
– old sequences
– partially annotated sequences
– inconsistent & outdated annotations (submitter annotation)
– error sequences, low-quality sequences
– contaminations
– anonymous sequence

– Annotation and management of databases is therefore, very


important

38
THE NCBI
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/About/glance/organizational.html
• National Center for Biotechnology Information.
• The heart of bioinformatics.
• It’s like an integrated library and resource portal for
bioinformatics databases data analysis.
• Houses the following:
– Genome sequencing data
• In GenBank.
– An index of biomedical research articles
• In PubMed Central and PubMed.
– Other information relevant to biotechnology.
– All these databases are available online through the Entrez
search engine.
Database Resource Portals

• ExPASy is a bioinformatics resource portal.


• Operated by the Swiss Institute of Bioinformatics (SIB).
– Particularly the SIB Web Team.
• It is an extensible and integrative portal accessing many
scientific resources, databases and software tools.
Database Resource Portals
Submitting Information to Databases
• You can submit your sequences to respective
databases through their sequence submission
interfaces:
– DNA sequence data
– Protein sequence data
– Proteomics data
– Microarray data
– Protein 3D structures
– Etc.
Submitting Information to Databases
Submitting Information to Databases
END

You might also like