Entrez

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Databases: 1

Book Shelf
 A organized set of data held in a
Book
computer, especially one that is
accessible in various ways.

Why Databases
Data
 Major goal in developing databases is
to provide efficient and user friendly Databases (Software) for
access to the data stored. data storage

Retrieval System
Retrieval Systems of NCBI: 2

Entrez SRS (Sequence GetEntry


Retrieval System)

• Search distinct health • Information indexing and • DDBJ flat file search
sciences databases retrieval system designed system by accession no.
for libraries with flat file
format.
• (EMBL nucleotide)
3
ENTREZ: 4

 First distributed on CD-ROM by NCBI in 1991.

 Text-based search and retrieval system of NCBI for databases like


PubMed, Protein Structures, Complete Genomes, Taxonomy, and many
others.

 Key feature is that it can integrate information, which comes from


cross-referencing between NCBI databases based on preexisting and
logical relationships between individual entries.
Continue… 5

 This is highly convenient: users do not have to visit multiple databases


located in disparate places.

For Example:
 In a nucleotide sequence page, one may find cross-referencing links to
the translated protein sequence, genome mapping data, or to the
related PubMed literature information, and to protein structures if
available.
Entrez basic retrieval links and tools: 6
BLAST: 7

 The Basic Local Alignment Search Tool (BLAST) compares


primary biological sequence information, such as the amino-
acid sequences of proteins or the nucleotides of DNA and or
RNA sequences.
VAST: 8

 The Vector Alignment Search


Tool (VAST) is a computer
algorithm developed at NCBI
and used to identify similar
protein 3-dimensional
structures
9
10
Phylogeny tool: 11
 Generates a common tree for a set of taxa.
 How to retrieve data regarding phylogenetic relationships via
Entrez using NCBI:
1) Search google for NCBI Tree Viewer
12
13
14
Text-based database searching: 15

Boolean Search
 Provides a way of generating precise queries that produce well-defined sets
of results. AND,NOT & OR are the Boolean operators used.

 Broadens the Search – If the results of a search produce no useful entries,


change or remove terms.

 Narrows the Search – If the results of a search produce too many entries,
change or add terms.
Text-based searching: 16

Boolean operators
 To perform complex queries in a database.
 This is to join a series of keywords using logical terms such as AND, OR, and
NOT to indicate relationships between the keywords used in a search.

AND Search result must contain both


OR Search for results containing either word or both
NOT Excludes results containing either one of the words
Example: promoters OR response elements NOT human AND mammals.
Continue… 17

Parenthesis
 Used to force a particular order of evaluation, similar to mathematical
statements.
 Enclosing individual concepts in parentheses changes this priority.
 Items contained within parentheses are executed first.

Example:
 gene AND (acid OR base).
If multiple terms are entered they are automatically AND’ed together.
Proximity searching: 18

 Only allows us to find terms that appear within a certain number of words
of each other.
 Find terms situated within a specified distance of each other in any
order. The closer they are, the higher the document appears in the
results list.

 NEAR, ADJ, SAME operators.


 We can search with multiword terms or phrases, place quotes around the
terms i.e A protein name, gene name or gene symbol directly can be
used.
Continued… 19

 To search for authors, their names must be entered in a


particular format: {Last name} {initials}
 No punctuation
 Only author fields will be searched in the database
 Searches can be further limited by adding [AUTHOR] to
the query string.
Continue… 20

Accession numbers or sequence identification numbers


 Can be searched, but specific formats are required (direct retrieval of full
sequence record) e.g.
CAA79696
NP_778203

 To find a match to an exact phrase, enclose it in quotation marks e.g.


"contactin associated protein"
"duchenne muscular dystrophy"
Truncation: 21

Wildcard

 The character * prepended/appended to a search term make a


search less specific.  
 It finds all terms that begin with a given string of text.
Example:
To look for all authors with last name Zav, search using Zav*.
 Only end-truncation is supported.
 Wildcards will only consider the first 150 matches to the string.
Continue… 22

 Molecular weights can be searched in the following format:

1) {weight}[Molecular Weight]
2) {weight minimum}:{weight maximum}[Molecular Weight]

 Other searches :
1) Accession numbers, [ACCN]
2) Sequence Length [SLEN]
23

Practical Example
Text-based Database Searching: 24

1) Basic
How to
?
2) Advanced Method 1(do a separate search for each term or phrase and
combine searches using History).
3) Advanced Method 2 (stack your query one step at a time (iterative
searching) using Preview/Index)
4) Complex Boolean Query Used often

https://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/Entrez/index.html
Basic: 25

 I need to retrieve human nucleotide sequences associated with


colon cancer.
 Just enter search terms without specifying search fields, other
limits, or Boolean operators.

All databases
26
Advanced Method 1: 27

 Do a separate search for each term or phrase and combine


searches using History.

 Limits and History.
28
29
Advanced Method 2: 30

Stacks the query one step at a time (iterative searching)


using Preview/Index)

Title
Colon cancer

Organism Humans
Complex Boolean query: 31

Boolean
Operators

Developed in 31st Oct 2007


32
Search builder: 33
Shortcut method: 34

 This method restricts the search to specific subsets of


records such as those from a specific organism, molecule
type or source database i.e. Facet/ Filters/ Limits.
Facets/ Filters/ Limits: 35
36

Non-redundancy
How can I download sequence records to a file on my computer?
37

 Click the Send to menu that


appears at the upper right of
document summaries or record
views.
 Select the file radio button.
 Then choose the desired format
from the pull-down list.
 Click the Create File button to
save the records.
Facets/Filters: 38

1) Organism

2) Molecule type (limit results to particular


molecule type)

3) Source database (allow us to limit the results to


a particular database)
Molecule types: 39

cRNA (anti-sense RNA)

 A short section of a gene or other DNA element that are used


to hybridize a cDNA 
Non-coding RNA (ncRNA)
 RNA molecule that is not translated into a protein.
 The DNA sequence from which a functional non-coding RNA is
transcribed is often called an RNA gene.
 Abundant and functionally important types of non-coding
RNAs include transfer RNAs(tRNAs) and ribosomal RNAs (rRNAs),
as well as small RNAs
40
41
Source databases: 42

INSDC

 The International Nucleotide Sequence Database Collaboration


(INSDC) is a long-standing foundational initiative that operates
between DDBJ, EMBL-EBI and NCBI.
 It covers the spectrum of data raw reads, through alignments and
assemblies to functional annotation, enriched with contextual
information relating to samples and experimental configurations.
Continue… 43

Third Party Annotation (TPA)

 It is a sequence derived or assembled from primary sequence data


currently found in the DDBJ/EMBL/GenBank International
Nucleotide Sequence Database.

 It can be genomic or mRNA sequence and can be assembled or


derived from primary genomic and/or mRNA sequences.
How do I change the format, number, or sorting
order of records displayed? 44
Formats: 45

Abstract Syntax Notation One(ASN 1)


• NCBI uses ASN.1 for the storage and retrieval of data such as
nucleotide and protein sequences, structures, genomes, PubMed
records, and more.

GenInfo Identifier (GI numbers)


 It is a simple series of digits that are assigned consecutively to
each sequence record processed by NCBI. . Each time a sequence
record is changed, it is assigned a new GI number.
Additional filters: 46

You might also like