Banking Op

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

 1996 Oxford University Press Nucleic Acids Research, 1996, Vol. 24, No.

1 197–200

The Blocks database—a system for protein


classification
Shmuel Pietrokovski, Jorja G. Henikoff and Steven Henikoff1,*
Fred Hutchinson Cancer Research Center, 1124 Columbia Street, Seattle, WA 98104, USA and 1Howard Hughes
Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98104, USA

Received September 5, 1995; Accepted September 19, 1995

ABSTRACT two-step procedure is repeated for each protein group and the
results are concatenated to make a database of blocks.
The Blocks Database contains multiple alignments of
conserved regions in protein families. The database Current database version
can be searched by e-mail and World Wide Web (WWW)
servers (http://blocks.fhcrc.org/help) to classify pro- Version 8.0 of the Blocks Database consists of 2884 blocks based
tein and nucleotide sequences. on 770 protein families documented in PROSITE 12.0 (5), which
is keyed to Swiss-Prot 29 (10). PROSITE also supplies the
documentation for each family. The distributions of number of
INTRODUCTION
blocks and number of sequences per family are shown in Figure 2
Many known proteins can be grouped into families according to for BLOCKS 8.0.
functional and sequence similarities. The similarity of the proteins
across the sequences in each family is far from uniform. While some Searching the Blocks database
regions are clearly conserved, others display little sequence The BLIMPS (Blocks IMProved Searcher) program searches the
similarity. Often the conserved regions are crucial to the protein’s Blocks Database (9). BLIMPS transforms each block into a position
function, for example enzymatic catalytic sites. Such conserved specific scoring matrix (PSSM), sometimes called a profile (11).
regions can be used to probe an uncharacterized sequence to indicate Each PSSM column corresponds to a block position and contains
its function (1). values based on the amino acid frequencies in each position.
The description of a protein family by its conserved regions To prevent domination of the PSSM by a large subgroup of
focuses on the family’s characteristic and distinctive sequence related sequences, each sequence segment in a block is weighted
features, thus reducing noise. Databases of conserved features of using position-based sequence weights (12). To reduce the effect
protein families can be utilized to classify sequences from proteins, of small sequence samples, the amino acid frequencies in each
cDNAs and genomic DNA (2–5). An example is the Blocks PSSM position (observed counts) are supplemented with artificial
Database (3), which consists of ungapped multiple alignments of ‘pseudo-counts’. Currently we model pseudo-counts on amino acid
short regions, called ‘blocks’ (6). The database was constructed substitution probabilities (13; SH and JGH, unpublished results).
from sequences of protein families using a fully automated method. BLIMPS compares a query sequence with a block by sliding
Searching the Blocks database with a sequence query allows the PSSM over the sequence (nucleotide sequences are translated
detection of one or more blocks representing a family. in all the frames into six amino acid sequences). For every
alignment, each sequence position receives the value of its amino
Block determination acid in the aligned PSSM column. These scores are summed to
obtain the score of the sequence segment. This is repeated with
A best set of blocks representing each protein group is found all blocks in the database, and the top scores are saved. In addition
automatically by the two-step PROTOMAT system (3). The first to searching a sequence against a database of blocks, BLIMPS
step incorporates a motif finder. Currently we use the MOTIF can search a block against a database of sequences.
algorithm (7): MOTIF exhaustively evaluates spaced triplets of
amino acids that are common to multiple sequences. We have also Block calibration
implemented a Gibbs sampling motif finder that iteratively
optimizes random ‘seeds’ for blocks (8). The MOTIF and Gibbs In order to recognize scores representing genuine relationships,
algorithms generate similar block sets for the sequences used in the it is necessary to know what scores are expected by chance alone.
Blocks Database (9). The second step of the PROTOMAT system To accomplish this, each block is calibrated by searching it
combines and refines the original blocks and assembles a best set against the Swiss-Prot sequence database. Two scores specific to
of blocks that is consistently found in most of the sequences in the the block are noted—the score at the 99.5% level of the true
group. An example of a best set of blocks for the iron-containing negative scores and the median of the true positive scores (14).
alcohol dehydrogenase family is presented in Figure 1. The True positive scores are scores of blocks optimally aligned with

* To whom correspondence should be addressed


198 Nucleic Acids Research, 1996, Vol. 24, No. 1

Figure 2. Statistics of Blocks Database version 8.0. (A) Number of blocks per
family. (B) Number of sequences per family.

their known family members and all other scores are assumed to
be true negatives.
Blocks vary in width and conservation and hence their search
scores are variable too. In order to compare scores from different
Figure 1. Blocks database format. Each block entry is divided into header and blocks the scores need to be normalized. The 99.5% scores are
sequences parts. The header part consists of four lines. The ID (identification) used to standardize the raw search scores. Each raw score is
line contains the block’s family short description and identifies the entry as a divided by the 99.5% score of the blocks and multiplied by 1000.
block type. The AC (accession) line gives the block’s accession code and the
minimal and maximal distances of the block from the previous block or the Therefore, any standardized score above 1000 is a result better
protein N′ end. The block accession code is made up of the letters ‘BL’ followed than all but the top 0.5% of the true negatives.
by the family PROSITE accession number and the individual block’s letter code The median of standardized scores for true positive alignments
(A for first block, B for second etc.; blocks from single block families have no is termed ‘strength’. Strong blocks are more effective than weak
letter suffix). The DE (description) line contains the long description of the
family. The short and long descriptions are taken from PROSITE. The BL
blocks (standardized scores <1100) at separating true positives
(block) line gives the spaced triplet motif of the block, the block’s width, from true negatives.
number of sequences, 99.5%-level raw score and strength score (median
standardized score of known true positive sequences). Each sequence line
contains the sequence Swiss-Prot name, the start position of the segment, the
Interpreting a search result
sequence segment and the sequence weight (100 being most distant). Segments
that are <80% similar are separated by blank lines. Each block entry ends with
The Blocks database can be searched with a sequence query
a ‘//’ line. The block entries are sorted by their accession codes, each family’s using the BLIMPS program on our e-mail and WWW servers.
blocks grouped together and ordered. The figure shows the three blocks of the As an example, BLOCKS 8.0 was searched with a bacterial
iron-containing alcohol dehydrogenase family from BLOCKS 8.0. dichlorocatechol oxidase (Swiss-Prot TFDF_ALCEU) as a
199

Nucleic Acids
Nucleic Acids Research,
Research,1994,
1996,Vol.
Vol.22,
24,No.
No.11 199

Figure 4. A sequence logo. Block BL00913C, shown in Figure 1, was converted


to a PSSM. For every column of the PSSM, each amino acid value was
represented as a letter in the stack. The vertical scale shows the conservation, in
bits, of the amino acids, which are shaded according to their properties.

single member of the family. The segment aligning with Block A


is closest to the segment of ADHE_ECOLI in the block. The other
two segments align best with a different member of this family
(ADH2_ZYMMO).
Intuitively, it seems unlikely that three high scoring blocks would
align with correct distances in between by chance alone. But how
unlikely? First, the alignment with the top ranking Block C
(scoring 1171) probably did not occur by chance, because such a
score was seen at the 99.33 percentile level of searches with
randomized queries (15). Secondly, Blocks A and B were
detected independently of the C (anchor) block. The probability
of detection of these two additional blocks by chance can be
estimated based on the rank of each block alignment, the sizes of
the query sequence and the database, and the observed distances
between blocks [see (15) for further details]. This estimate is
about 3 in ten million (‘P<2.7e-07 for BL00913A BL00913B in
support of BL00913C’). The two independent measures, percen-
tile and P estimate, can be combined to provide a confidence level
of less than once in 7000 searches. We conclude that the query is
a member of the iron-containing alcohol dehydrogenase family.
Figure 3. Block Search output, showing the first three hits. See discussion in text. Examining the blocks and the PROSITE documentation of the
family we see that Block C contains histidine residues that are
probably important for binding the ferrous ion(s) required for the
protein query sequence. The search output for the first three hits enzyme activity (16). Blocks can be viewed graphically on our
is shown in Figure 3. WWW server (9) as sequence logos (17). Logos display the
The three best alignments in the entire search are with the different amino acids in each position, the conservation of each
blocks of the iron-containing alcohol dehydrogenase family. All position and of each amino acid. In the logo for Block C (Fig. 4),
three blocks align with the query sequence in the same order as conserved residues are easily seen. Note that the invariant glycine
the sequences represented in the blocks, that is, A→B→C. This in position 17 in the block is substituted by alanine in the query
is most easily seen in the block map. This map also shows that the sequence; this illustrates the flexibility of the search system.
distances between the three blocks representing this family fit the The second and third hits illustrate chance alignments. Both hits
distances between the segments of the query that align with these rank below the 60th percentile. The second hit is a marginal
blocks. For example, the distance between A and B varies from multiple block hit. Even though the top ten block hits in a search
43 to 73 in known members of this family and is 41 for the query. are reported, one should be increasingly cautious about block
Therefore, the query might be a member of this family. Additional alignments with low percentiles. Note also that the P estimates for
evidence concerning a family relationship comes from examin- blocks in support becomes less meaningful as one goes down the
ation of the alignment of each query segment with the closest list, and that no P estimates are reported for single block hits.
200 Nucleic Acids Research, 1996, Vol. 24, No. 1

OTHER USES OF THE BLOCKS DATABASE E-mail servers


[email protected] (for searching the Blocks Database)
The automated construction and extensive data in the Blocks [email protected] (for making blocks from user
database make it suitable for uses other than protein classification. supplied protein sequences)
The local alignments of sequence segments provided data for the
BLOSUM series of amino acid substitution matrices (18). These Send the word ‘help’ in the subject line or as the only word in the
matrices performed very well in sequence database searches message body to obtain help files from both servers.
(19,20). The Blocks database was also used to test and compare Queries or more information about the BLOCKS database can
different methods for weighting sequences to reduce redundancy be obtained by sending an email to: [email protected].
(12).
Many blocks are made up of sequence segments with known WWW
functions such as ligand binding regions, catalytic domains and
http://blocks.fhcrc.org
transmembranal domains (SP unpublished observations). This can
This site offers Blocks database searches, block retrievals, block
be a resource for research on specific domains. For example, in
logos, block construction, help files and related bibliography.
studying protein nucleotide binding sites one can search for block
families annotated as having such sites or for blocks containing the
known signature of the sites. The blocks found can help refine the ACKNOWLEDGEMENTS
signature and even reveal unannotated sites. SP is a Howard Hughes Medical Institute Fellow of the Life
Sciences Research Foundation. This work is supported by a grant
from the NIH (GM29009).

OTHER SEARCHABLE DATABASES OF PROTEIN


REFERENCES
FAMILIES
1 Bork, P., Ouzounis, C. and Sander, C. (1994) Curr. Opin. Struct. Biol., 4,
393–403.
PROSITE is a compilation of specific sites, patterns and profiles 2 Smith, R.F. and Smith, T.F. (1990) Proc. Natl. Acad. Sci. USA, 87,
found in protein sequences (5). PRINTS (4), ProDom (21) and 118–122.
3 Henikoff, S. and Henikoff, J. G. (1991) Nucleic Acids Res., 19,
SBASE (22) are databases of protein motifs and domains. 6565–6572.
PRINTS and SBASE have cross references to the Blocks 4 Attwood, T.K., Beck, M.E., Bleasby, A.J. and Parry-Smith, D.J. (1994)
database. All these databases find conserved regions by different Nucleic Acids Res., 22, 3590–3596.
methods and may include different groups of proteins. Therefore, 5 Bairoch, A. and Bucher, P. (1994) Nucleic Acids Res., 22, 3583–3589.
different databases can provide complementary information. 6 Posfai, J., Bhagwat, A. S., Posfai, G. and Roberts, R. J. (1989) Nucleic
Acids Res., 17, 2421–2435.
7 Smith, H. O., Annau, T. M. and Chandrasegaran, S. (1990) Proc. Natl.
Acad. Sci. USA, 87, 826–830.
8 Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F.
and Wootton, J. C. (1993) Science, 262, 208–214.
ACCESS 9 Henikoff, S., Henikoff, J. G., Alford, W. J. and Pietrokovski, S. (1995)
Gene, 163, GC 17–26.
10 Bairoch, A. and Boeckmann, B. (1994) Nucleic Acids Res., 22,
Anonymous FTP 3578–3580.
11 Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Proc. Natl.
Acad. Sci. USA, 84, 4355–4358.
Location Address Directory 12 Henikoff, S. and Henikoff, J. G. (1994) J. Mol. Biol., 243, 574–578.
USA ncbi.nlm.nih.gov /repository/blocks 13 Tatusov, R.L., Altschul, S.F. and Koonin, E.V. (1994) Proc. Natl. Acad.
Sci. USA, 91, 12091–12095.
UK ftp.ebi.ac.uk /pub/databases/blocks 14 Henikoff, S. and Henikoff, J. G. (1995) Methods Enzymol., in press.
Israel bioinformatics.weizmann.ac.il /pub/databases/blocks 15 Henikoff, S. and Henikoff, J. G. (1994) Genomics, 19, 97–107.
Japan ftp.nig.ac.jp /pub/db/blocks 16 Cabiscol, E., Aguilar, J. and Ros, J. (1994) J. Biol. Chem., 269,
6592–6597.
The Blocks database is distributed as a flat text file containing the 17 Schneider, T. D. and Stephens, R. M. (1990) Nucleic Acids Res., 18,
6097–6100.
individual block entries. 18 Henikoff, S. and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA, 89,
The NCBI site also includes the software that we developed to 10915–10919.
construct and utilize the Blocks Database, including the BLIMPS 19 Henikoff, S. and Henikoff, J. G. (1993) Proteins, 17, 49–61.
search program. 20 Pearson, W.R. (1995) Protein Sci., 4, 1145–1160.
21 Sonnhammer, E.L. and Kahn, D. (1994) Protein Sci., 3, 482–492.
The BlockSearch program developed by R. Fuchs for fast block 22 Pongor, S., Hatsagi, Z., Degtyarenko, K., Fabian, P., Skerl, V., Hegyi, H.,
searches (23) can be found at the UK site in directories Murvai, J. and Bevilacqua, V. (1994) Nucleic Acids Res., 22, 3610–3615.
pub/software/unix and pub/software/vax. 23 Fuchs, R. (1994) CABIOS, 10, 79–80.

You might also like