Biological Databases: DR Z Chikwambi Biotechnology
Biological Databases: DR Z Chikwambi Biotechnology
Biological Databases: DR Z Chikwambi Biotechnology
Databases
Dr Z Chikwambi
Biotechnology
Objectives of lecture:
You are expected to describe/ understand
the following:
1. To describe databases.
2. Describe features of a database.
3. Describe an ideal database.
4. Describe the purposes of databases.
5. Describe how databases are integrated
supported by resource portals.
6. Describe problems associated with
databases.
DATABASES
Database or databank?
Initially
• Databank (in UK)
• Database (in the USA)
Solution
• The abbreviation db
What is a Database?
6
What are Databases?
• A database is a structured collection of
information.
– A database consists of basic units called records or
entries.
– Each record consists of fields, which hold pre-
defined data related to the record.
– For example, a protein database would have
protein sequences as records and protein
properties as fields (e.g., name of protein, length,
amino-acid sequence, …)
Database: a « flat file » example
Flat-file database (« flat file, 3 entries »):
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: Pottery 2000; Pottery 2001;
//
Accession number: 2
First Name: Dan
Last name: Graur
Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002
//
Accession number 3:
First Name: John
Last name: Travolta
Course: Ballet 2001; Ballet 2002;
//
• Easy to manage: all the entries are visible at the same time !
8
• A Database
– Can be thought of as a large table, where the rows
represent records and the columns represent fields.
• Secondary
– Information derived from primary databases
– Signature sequences, domains, motiffs, pathways, diseases
• Integrated
Categories of databases for Life Sciences
• Sequences (DNA, protein)
• Genomics
• Mutation/polymorphism
• Protein domain/family
• Proteomics (2D gel, Mass Spectrometry)
• 3D structure
• Metabolic networks
• Regulatory networks
• Bibliography
• Expression (Microarrays,…)
• Specialized
14
Databases: Examples
• Integrated systems of databases:
» E.g., NCBI (Protein, Nucleotide, Gene, OMIM, BLAST etc.)
• Protein Databases:
» E.g., ExPASy (SwissProt + TrEMBL)
• Protein structures:
» E.g., PDB and PDBsum
• Pathway databases:
» E.g., KEGG (Kyoto Encyclopedia of Genes and Genomes)
15
Databases: Protein sequences
• SWISS-PROT: created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/
Eg. PubMed
– Contains entries for more than 11 million abstracts of scientific
publications.
– Enables user to do keyword searches, provides links to a selection of
full articles, and has text mining capabilities, e.g. provides links to
related articles, and GenBank entries, among others.
– Efficient searching PubMed requires some skill. For example, 18
searching with a keyword “interleukin” returns 108,366 matches.
The OMIM (Online Mendelian
Inheritance in Man)
20
OMIM search tags
21
Start working: Tarea
Open: UniprotKB
Search for nitrogen fixing genes
22
What is Google Scholar?
23
What other DATA can we retrieve from the record?
24
There are approximately
286,730,369,256 sequence
records in the traditional
GenBank divisions as of 2011.
EMBL: The Genome divisions
http://www.ebi.ac.uk/genomes/
http://megasun.bch.umontreal.ca/gobase/
Single Nucleotide Polymorphisms
(SNPs) Database
Within a database, the format needs to be
consistent.
A SwissProt entry, in Fasta format:
request query
4. Cross-referenced.
5. Minimum redundancy.
38
THE NCBI
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/About/glance/organizational.html
• National Center for Biotechnology Information.
• The heart of bioinformatics.
• It’s like an integrated library and resource portal for
bioinformatics databases data analysis.
• Houses the following:
– Genome sequencing data
• In GenBank.
– An index of biomedical research articles
• In PubMed Central and PubMed.
– Other information relevant to biotechnology.
– All these databases are available online through the Entrez
search engine.
Database Resource Portals