Banking Op
Banking Op
Banking Op
1 197–200
ABSTRACT two-step procedure is repeated for each protein group and the
results are concatenated to make a database of blocks.
The Blocks Database contains multiple alignments of
conserved regions in protein families. The database Current database version
can be searched by e-mail and World Wide Web (WWW)
servers (http://blocks.fhcrc.org/help) to classify pro- Version 8.0 of the Blocks Database consists of 2884 blocks based
tein and nucleotide sequences. on 770 protein families documented in PROSITE 12.0 (5), which
is keyed to Swiss-Prot 29 (10). PROSITE also supplies the
documentation for each family. The distributions of number of
INTRODUCTION
blocks and number of sequences per family are shown in Figure 2
Many known proteins can be grouped into families according to for BLOCKS 8.0.
functional and sequence similarities. The similarity of the proteins
across the sequences in each family is far from uniform. While some Searching the Blocks database
regions are clearly conserved, others display little sequence The BLIMPS (Blocks IMProved Searcher) program searches the
similarity. Often the conserved regions are crucial to the protein’s Blocks Database (9). BLIMPS transforms each block into a position
function, for example enzymatic catalytic sites. Such conserved specific scoring matrix (PSSM), sometimes called a profile (11).
regions can be used to probe an uncharacterized sequence to indicate Each PSSM column corresponds to a block position and contains
its function (1). values based on the amino acid frequencies in each position.
The description of a protein family by its conserved regions To prevent domination of the PSSM by a large subgroup of
focuses on the family’s characteristic and distinctive sequence related sequences, each sequence segment in a block is weighted
features, thus reducing noise. Databases of conserved features of using position-based sequence weights (12). To reduce the effect
protein families can be utilized to classify sequences from proteins, of small sequence samples, the amino acid frequencies in each
cDNAs and genomic DNA (2–5). An example is the Blocks PSSM position (observed counts) are supplemented with artificial
Database (3), which consists of ungapped multiple alignments of ‘pseudo-counts’. Currently we model pseudo-counts on amino acid
short regions, called ‘blocks’ (6). The database was constructed substitution probabilities (13; SH and JGH, unpublished results).
from sequences of protein families using a fully automated method. BLIMPS compares a query sequence with a block by sliding
Searching the Blocks database with a sequence query allows the PSSM over the sequence (nucleotide sequences are translated
detection of one or more blocks representing a family. in all the frames into six amino acid sequences). For every
alignment, each sequence position receives the value of its amino
Block determination acid in the aligned PSSM column. These scores are summed to
obtain the score of the sequence segment. This is repeated with
A best set of blocks representing each protein group is found all blocks in the database, and the top scores are saved. In addition
automatically by the two-step PROTOMAT system (3). The first to searching a sequence against a database of blocks, BLIMPS
step incorporates a motif finder. Currently we use the MOTIF can search a block against a database of sequences.
algorithm (7): MOTIF exhaustively evaluates spaced triplets of
amino acids that are common to multiple sequences. We have also Block calibration
implemented a Gibbs sampling motif finder that iteratively
optimizes random ‘seeds’ for blocks (8). The MOTIF and Gibbs In order to recognize scores representing genuine relationships,
algorithms generate similar block sets for the sequences used in the it is necessary to know what scores are expected by chance alone.
Blocks Database (9). The second step of the PROTOMAT system To accomplish this, each block is calibrated by searching it
combines and refines the original blocks and assembles a best set against the Swiss-Prot sequence database. Two scores specific to
of blocks that is consistently found in most of the sequences in the the block are noted—the score at the 99.5% level of the true
group. An example of a best set of blocks for the iron-containing negative scores and the median of the true positive scores (14).
alcohol dehydrogenase family is presented in Figure 1. The True positive scores are scores of blocks optimally aligned with
Figure 2. Statistics of Blocks Database version 8.0. (A) Number of blocks per
family. (B) Number of sequences per family.
their known family members and all other scores are assumed to
be true negatives.
Blocks vary in width and conservation and hence their search
scores are variable too. In order to compare scores from different
Figure 1. Blocks database format. Each block entry is divided into header and blocks the scores need to be normalized. The 99.5% scores are
sequences parts. The header part consists of four lines. The ID (identification) used to standardize the raw search scores. Each raw score is
line contains the block’s family short description and identifies the entry as a divided by the 99.5% score of the blocks and multiplied by 1000.
block type. The AC (accession) line gives the block’s accession code and the
minimal and maximal distances of the block from the previous block or the Therefore, any standardized score above 1000 is a result better
protein N′ end. The block accession code is made up of the letters ‘BL’ followed than all but the top 0.5% of the true negatives.
by the family PROSITE accession number and the individual block’s letter code The median of standardized scores for true positive alignments
(A for first block, B for second etc.; blocks from single block families have no is termed ‘strength’. Strong blocks are more effective than weak
letter suffix). The DE (description) line contains the long description of the
family. The short and long descriptions are taken from PROSITE. The BL
blocks (standardized scores <1100) at separating true positives
(block) line gives the spaced triplet motif of the block, the block’s width, from true negatives.
number of sequences, 99.5%-level raw score and strength score (median
standardized score of known true positive sequences). Each sequence line
contains the sequence Swiss-Prot name, the start position of the segment, the
Interpreting a search result
sequence segment and the sequence weight (100 being most distant). Segments
that are <80% similar are separated by blank lines. Each block entry ends with
The Blocks database can be searched with a sequence query
a ‘//’ line. The block entries are sorted by their accession codes, each family’s using the BLIMPS program on our e-mail and WWW servers.
blocks grouped together and ordered. The figure shows the three blocks of the As an example, BLOCKS 8.0 was searched with a bacterial
iron-containing alcohol dehydrogenase family from BLOCKS 8.0. dichlorocatechol oxidase (Swiss-Prot TFDF_ALCEU) as a
199
Nucleic Acids
Nucleic Acids Research,
Research,1994,
1996,Vol.
Vol.22,
24,No.
No.11 199