GKN 663

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Published online 05 October 2008 Nucleic Acids Research, 2009, Vol.

37, Database issue D233–D238


doi:10.1093/nar/gkn663

The Carbohydrate-Active EnZymes database


(CAZy): an expert resource for Glycogenomics
Brandi L. Cantarel, Pedro M. Coutinho, Corinne Rancurel, Thomas Bernard,
Vincent Lombard and Bernard Henrissat*
Architecture et Fonction des Macromolécules Biologiques, UMR6098, CNRS, Universités Aix-Marseille I & II,
163 Avenue de Luminy, 13288 Marseille, France

Downloaded from https://academic.oup.com/nar/article/37/suppl_1/D233/1003505 by guest on 01 December 2023


Received September 15, 2008; Accepted September 19, 2008

ABSTRACT antibiotics, etc.), the large variety of enzymes acting


on these glycoconjugates, oligo- and polysaccharides
The Carbohydrate-Active Enzyme (CAZy) database probably constitute one of the most structurally diverse
is a knowledge-based resource specialized in the set of substrates on Earth. Collectively designated as
enzymes that build and breakdown complex carbo- Carbohydrate-Active enZymes (CAZymes), these enzymes
hydrates and glycoconjugates. As of September build and breakdown complex carbohydrates and glyco-
2008, the database describes the present knowledge conjugates for a large body of biological roles (collectively
on 113 glycoside hydrolase, 91 glycosyltransferase, studied under the term of Glycobiology). Therefore, CAZ-
19 polysaccharide lyase, 15 carbohydrate esterase ymes have to perform their function usually with high
and 52 carbohydrate-binding module families. specificity. Because carbohydrate diversity (1) exceeds by
These families are created based on experimentally far the number of protein folds, CAZymes have evolved
characterized proteins and are populated by from a limited number of progenitors by acquiring novel
sequences from public databases with significant specificities at substrate and product level. Such a dizzying
similarity. Protein biochemical information is con- array of substrates and enzymes makes CAZymes a partic-
tinuously curated based on the available literature ularly challenging subject for experimental characteriza-
and structural information. Over 6400 proteins have tion and for functional annotation in genomes.
Nearly 20 years ago, the first foundation for a family
assigned EC numbers and 700 proteins have a PDB
classification of CAZymes was seen in an effort that clas-
structure. The classification (i) reflects the structural sified cellulases into several distinct families based on
features of these enzymes better than their sole sub- amino-acid sequence similarity (2). Soon after, the
strate specificity, (ii) helps to reveal the evolutionary family classification system based on protein sequence
relationships between these enzymes and (iii) pro- and structure similarities, was extended to all known gly-
vides a convenient framework to understand mech- coside hydrolases (2–4), and subsequently extended to all
anistic properties. This resource has been available CAZymes involved in the synthesis, degradation and mod-
for over 10 years to the scientific community, contri- ification of glycoconjugates. The classification of CAZ-
buting to information dissemination and providing a ymes has been made available on the web since
transversal nomenclature to glycobiologists. More September 1998. Because based on amino-acid sequence
recently, this resource has been used to improve similarities, these classifications correlate with enzyme
the quality of functional predictions of a number mechanisms and protein fold more than enzyme specifi-
genome projects by providing expert annotation. city. Consequently, these families are used to conserva-
The CAZy resource resides at URL: http://www. tively classify proteins of uncharacterized function whose
only known feature is sequence similarity to an experimen-
cazy.org/.
tally characterized enzyme, avoiding overprediction of
enzyme activities.
INTRODUCTION At present, CAZy covers approximately 300 protein
families in the following classes of enzyme activities:
Due to the extreme variety of monosaccharide structures,
to the variety intersugar linkages and to the fact that vir- (1) Glycoside hydrolases (GHs), including glycosidases
tually all types of molecules can be glycosylated (from and transglycosidases (3–5). These enzymes consti-
sugars themselves, to proteins, lipids, nucleic acids, tute 113 protein families that are responsible for

*To whom correspondence should be addressed. Tel: +33 4 91 82 55 87; Fax: +33 491 26 67 20; Email: [email protected]
Correspondence may also be addressed to Pedro M. Coutinho. Email: [email protected]

ß 2008 The Author(s)


This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
D234 Nucleic Acids Research, 2009, Vol. 37, Database issue

the hydrolysis and/or transglycosylation of glycosidic groups of organisms. Day-to-day inspection of new
bonds. GH-coding genes are abundant and present in enzyme characterizations reported in the literature regu-
the vast majority of genomes corresponding to larly led and continues to lead to the definition of new
almost half—presently about 47%—of the enzymes enzyme families. Significantly, the CAZy families, origin-
classified in CAZy. Because of their widespread ally created following hydrophobic cluster analysis in the
importance for biotechnological and biomedical app- 1990s from very limited number of sequences available
lications, GHs constitute so far the best biochemi- (2–6) and later complemented by BLAST- and HMMer-
cally characterized set of enzymes present in the based sequence similarity approaches, are globally surviv-
CAZy database. ing the challenge of time in spite of a hundred-fold
(2) Glycosyltransferases (GTs). These are the enzymes increase in the number of sequences.
responsible for the biosynthesis of glycosidic bonds
from phospho-activated sugar donors (6–8). They

Downloaded from https://academic.oup.com/nar/article/37/suppl_1/D233/1003505 by guest on 01 December 2023


form over 90 sequence-based families and present DATABASE CONTENT
in virtually every single organism and represent
The CAZy database contains information from (i)
about 41% of CAZy at present.
sequence annotations from publicly available sources,
(3) Polysaccharide lyases (PLs) cleave the glycosidic
namely the NCBI, including taxonomical, sequence
bonds of uronic acid-containing polysaccharides by
and reference information, (ii) family classification and
a b-elimination mechanism (6). They are presently
(iii) known functional information. This data allow the
found in 19 families in CAZy (7), corresponding to
only about 1.5% of CAZy content. Many PLs have exploration of an enzyme (CAZyme), all CAZymes in an
biotechnological and biomedical applications and, organism or a CAZy protein family. The addition of new
despite their small overall number, they are among family members and the incorporation of biochemical
the CAZymes with the highest proportion of bio- information extracted from the literature are updated reg-
chemically characterized examples present in the ularly, following careful inspection. Newly released three-
database. dimensional (3D) structures and genomes are analyzed as
(4) Carbohydrate esterases (CEs). They remove ester- they are released by public databases. Daily update
based modifications present in mono-, oligo- and releases from GenBank form the bulk of sequence addi-
polysaccharides and thereby facilitate the action tions to the database (8) are complemented by weekly
of GHs on complex polysaccharides. Presently PDB releases (13). Presently only genome released
described in 15 families (7), CEs represent roughly through these GenBank releases are analyzed regularly,
5% of CAZy entries. As the specificity barrier whereas other genomes protein predictions are analyzed
between carbohydrate esterases and other esterase upon request as part of collaborative efforts (vide infra).
activities is low, it is likely that the sequence-based Another feature of CAZy is that the number of families,
classification incorporates some enzymes that may the family-associated information and content are con-
act on non-carbohydrate esters. tinuously updated. When new families are created, old
(5) Carbohydrate-binding modules (CBMs). These are previously released genomes and sequence in public data-
autonomously folding and functioning protein frag- bases are reanalyzed to take the additional new family into
ments that have no enzymatic activity per se but are account to ensure completeness in sequence description.
known to potentiate the activity of many enzyme Internally, curators include and maintain all referenced
activities described above by targeting to and pro- biochemical and other characterization data from the lit-
moting a prolonged interaction with the substrate. erature and the analysis of full sets of protein sequences
CBMs are most often associated to the other carbo- present in a single genome. Because of this continuous
hydrate-active enzyme catalytic modules in the same effort of data addition, new families are frequently
polypeptide and can target different substrate forms added and reflect the advances in experimental character-
depending on different structural characteristics ization of CAZymes. New families are exclusively created
(9,10). However, occasionally they can be present based on the availability of at least one biochemically-
in isolated or tandem forms not coupled with an characterized member for which a sequence is available
enzyme. Roughly 7% of CAZy entries contain at and the information published in peer-reviewed scientific
least one CBM module. CBMs are presently classi- literature. This sequence then serves as a seed for the
fied in 52 families in CAZy (7). family that is gradually extended with sequences that
share statistically significant similarity.
In addition to protein families that are well curated by Only functional assignments based on experimental
the CAZy database, CAZymes are known to contain data are included in the CAZy database by the association
domains not acting on carbohydrates, including other of EC numbers to protein sequences. Therefore inferred
enzymes—such as proteases, myosin motors or phospha- functional assignments are not included. Experimental
tases, etc.—and a variety of protein–protein or protein– data are ideally a direct enzyme analysis, but also could
cell wall binding domains—cohesins, SLHs, TPR, etc. include indirect evidence such as gene knockout experi-
The CAZy family classification system covers all taxo- ments with extensive characterization. Because there is a
nomic groups, and provides the ground for common shortage of EC numbers, relative to the number of func-
nomenclature for CAZymes across different glycobiolo- tions characterized experimentally, some incomplete EC
gists (11,12) generally specialized only in some specific numbers such as 3.2.1.-, 2.4.1.-, 2.4.2.- and 2.4.99.- are
Nucleic Acids Research, 2009, Vol. 37, Database issue D235

also included in the database. In addition, as the described MANUAL FUNCTIONAL ANALYSIS
functions in CAZy are only of enzymatic nature, addi- All too often, functional annotation methods employed
tional and complementary binding and inhibitory func- during whole genome annotation are erroneous and lack
tions known to be associated with several CAZy consistent language (12,15). While sequence similarity to
proteins will be curated and explored in the near future. genes annotated by GO or best BLAST hits can be a
good-starting point to assignment to pathways or possible
general functions, such as serine/theonine kinase, many
SEMI-AUTOMATIC MODULAR ASSIGNMENT automatic functional assignments are unfortunately
much more specific. This is particularly true in the case
Carbohydrate-active enzymes, can exhibit a modular of CAZymes, since related families of the latter group
structure (Figure 1), where a module can be defined as a together enzymes of widely differing specificity.

Downloaded from https://academic.oup.com/nar/article/37/suppl_1/D233/1003505 by guest on 01 December 2023


structural and functional unit (7,14). Each family in CAZy The CAZy database employs practices that aim to elim-
is dependent on the definition of a common segment in inate the problems with automatic annotation.
each full sequence that ultimately contains the catalytic or Biochemical characterization of new proteins from the
binding module. The definition of the limits within the literature is used to create new protein families, to anno-
sequence of the composing modules depends on available tate their referring entries and to update family descrip-
information derived from a combination of different tions (6). These biochemical assignments are also
approaches: employed to help the manual curator estimate the likely
general functions and add descriptions that indicate which
(1) protein 3D structures, enzymatically characterized proteins are related to new
(2) reported deletion studies and sequences. Inclusion of reference data compiled by com-
(3) protein-sequence analysis and comparisons. munities centered on model organisms is considered for
Different sequence comparison tools are used to define the future. Bibliographic references are included in CAZy
enzyme families, particularly gapped BLAST (9) and by a specific layer that includes over 16 000 different bib-
HMMER (10) using hidden markov models (HMMs) liographic references. These references were extracted
made from each family. All the sequences corresponding automatically from individual accessions using ProFal
to the catalytic and binding of carbohydrate-active (16) and about one-third was entered manually.
enzymes are excised from the full protein sequence and When functional predictions are made, they arise from
grouped in a BLAST library. Positive hits against this manual curation by examination of closely related
‘high quality’ library, are entered into the database by sequences and when biochemical information is not avail-
trained curators following manual check on a daily basis able, such as the case for many genome projects, very
with a small number of sequences with high identity general functional tags are used to convey general func-
(>85%) ungapped alignments to previously examined tions of a family. Recently, we have begun further break-
sequences being entered automatically. ing down families into subfamilies in the hope of grouping
A new layer dealing with the analysis of whole protein proteins by specificity using sequence similarity. This
sets issued from genomes has been introduced recently. would allow us to give more insights into possible func-
Modular annotation has been in fact applied to genome tions. This new classification can also give insights into
data released by the NCBI, with over 750 genomes ana- conserved active sites and active site specificity, when com-
lyzed. Approximately 1–3% of the proteins encoded by a paring biochemically characterized enzymes. Currently
subfamily assignments are available publicly only for
typical genome correspond to CAZymes (10,11). In addi-
GH13 (14), GH1, GH2 and GH5 (released with this pub-
tion to publicly released sequences, annotation of proteins
lication). This effort will be continued in the future with
in recently sequenced genomes prior to full release are
many more subfamilies being incorporated into the CAZy
regularly performed by the CAZy team in collaboration
knowledge base in the future. Subfamilies identify sub-
with scientists from all over the globe.
groups of sequences that are more homogeneous in their
functional properties. Most identified subfamilies are
(a)
monospecific. If polyspecific, the functional variability is
(b) low and typically limited to two or three EC activities.
(c)
There, often the known subfamily functions often share
a substrate or product. Furthermore, rational enzyme
(d) engineering may be used to switch the functions for several
(e) cases (data not shown). Subfamilies also open the door for
further enzymatic characterization—a few subfamilies as
(f)
still no known activity—or for the identification of mean-
Figure 1. Examples of modular carbohydrate-active enzymes. (a)
ingful targets for structural characterization.
Cellobiohydrolase I from Hypocrea jecorina (SP P00725); (b) alginate
lyase from Sphingomonas sp. A1 (GB BAB03312.1); (c) xylanase from
Cellulomonas fimi (GB CAA54145.1); (d) xylanase D/lichenase LARGE-SCALE ANALYSIS AND COLLABORATION
from Ruminococcus flavefaciens (GB CAB51934.1); (e) chitin synthase
from Emericella nidulans (GB BAA21714.1); (f) cyclicb-1-3-glucan Internal CAZy tools, such as our semi-automatic modular
synthase from Bradyrhizobium japonicum (GB AAC62210.1). assignment presently allow the analysis of a larger number
D236 Nucleic Acids Research, 2009, Vol. 37, Database issue

of sequences than a few years ago, making it possible to components of carbohydrate-based systems now emerges.
perform large-scale analyses, such as the annotation of Examples include: N- and O-glycosylation of proteins,
CAZyme systems in genomes and metagenomic investiga- starch metabolism, biosynthesis of the cell-wall and its
tions of the breakdown of complex carbohydrates. A typi- subcomponents. Geisler-Lee et al. (19) have combined
cal genome analysis begins with the assignment of protein bioinformatics and transcriptome analysis of various
models to one or several CAZy families (depending on the poplar and Arabidopsis tissues and organs and have
number of CAZy modules present within the sequence). shown that CAZyme transcripts are particularly abundant
This family assignment is then followed by the prediction in wood tissues.
of general functional classes using a manual examination
of alignments to closely related sequences, taking care to
identify the retention of active-site residues. Once a NEW FEATURES

Downloaded from https://academic.oup.com/nar/article/37/suppl_1/D233/1003505 by guest on 01 December 2023


genome is categorized by family and functional classes, In addition to a website facelift, the new CAZy website
gene content analysis is utilized to give insights into how comes with a host of new features. Primarily, we are now
newly sequenced organisms might be similar or different offering users the ability to search the CAZy site for infor-
from closely related species. Differences in genome con- mation by GenBank protein accession number, family or
tent, i.e. relative family size, might reflect the relative organism rather than navigate long static pages as prior to
diversity or complexity of the inherent biological processes 12/31/2008 (Figure 2). To the new site we are also includ-
(17) and therefore, the biology of the compared species. ing, pages to describe new releases, new genomes and
For example, differences suggesting a more pronounced other new features. In addition, tools developed in the
pectin metabolism in ‘dicot’ Arabidopsis versus ‘monocot’ lab are available for interactive use.
rice have been noted (17) as well as expected differences in
cell-wall metabolism between short-lived annual
Arabidopsis versus long-lived poplar tree have been sug- FUTURE TRENDS
gested (18). With the advent of a variety of post-genomic The CAZy database is a fluid database always changing
techniques, a new vision of the CAZymes as significant and growing as additional data becomes available.

Figure 2. (A) Once a search is performed, such as for a protein accession (P00275), the resulting page indicates the modular families that compose
that protein. (B) Upon clicking the resulting links provided in A, users are directed to a page about the family and gives a listing of all annotated
members.
Nucleic Acids Research, 2009, Vol. 37, Database issue D237

In the last 2 years, the number of sequences in CAZY has the website and at www.cazypedia.org. Software from
nearly doubled and the number of available genomes is the group is available at www.cazy.org/tools.
over 750. We believe this trend will continue in the coming
years. Unfortunately, while sequencing is forever more
rapid, progress in structural information and biochemical FUNDING
characterization is much slower. The number of biochem- The authors wish to thank the Departement des Sciences
ical data has grown only by 8% over the last 2 years de la Vie of CNRS for a 2-year funding grant to B.L.C.
(Figure 3). This means that the gap is widening between and Novozymes for a contract supporting V.L.
available sequences and biochemically characterized
enzymes, making better methods for high-throughput bio- Conflict of interest statement. P.M.C. is affiliated to
chemical characterization advantageous. Université de Provence (Aix-Marseille I) and B.H. and
C.R. are members of CNRS.

Downloaded from https://academic.oup.com/nar/article/37/suppl_1/D233/1003505 by guest on 01 December 2023


As started previously, we are actively pursuing the clas-
sification of subfamilies within each family. This further
level of classification is important for instance to identify
key residues or motifs important to define specificity. REFERENCES
Finally, we hope to offer soon a page to submit sequences 1. Laine,R.A. (1994) A calculation of all possible oligosaccharide
for a sequence similarity search and keyword search on isomers both branched and linear yields 1.05  10(12) structures for
our website. a reducing hexasaccharide: the Isomer Barrier to development of
single-method saccharide sequencing or synthesis systems.
Glycobiology, 4, 759–767.
AVAILABILITY ON THE WEB 2. Henrissat,B., Claeyssens,M., Tomme,P., Lemesle,L. and
Mornon,J.P. (1989) Cellulase families revealed by hydrophobic
The CAZy database is available at www.cazy.org. cluster analysis. Gene, 81, 83–95.
Information about selected families is available through 3. Henrissat,B. (1991) A classification of glycosyl hydrolases
based on amino acid sequence similarities. Biochem. J., 280 (Pt 2),
309–316.
80 4. Henrissat,B. and Bairoch,A. (1993) New families in the
classification of glycosyl hydrolases based on amino acid sequence
Entries similarities. Biochem. J., 293 (Pt 3), 781–788.
w/ EC #s 5. Henrissat,B. and Bairoch,A. (1996) Updating the sequence-based
classification of glycosyl hydrolases. Biochem. J., 316 (Pt 2),
w/ PDBs 695–696.
6. Yip,V.L. and Withers,S.G. (2006) Breakdown of oligosaccharides
60 by the process of elimination. Curr. Opin. Chem. Biol., 10, 147–155.
7. Coutinho,P.M. and Henrissat,B. (1999) Carbohydrate-active
enzymes: an integrated database approach. In Gilbert,H.J.,
Davies,G., Henrissat,H. and Svensson,B. (eds), Recent Advances in
Carbohydrate Bioengineering. The Royal Society of Chemistry,
Cambridge, pp. 3–12.
Number/1000

8. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and


40 Wheeler,D.L. (2004) GenBank: update. Nucleic Acids Res., 32,
D23–D26.
9. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
10. Eddy,S.R. (1995) Multiple alignment using hidden Markov models.
20 In Proc. Intl Conf. Intel. Syst. Molec. Biol. ISMB, 3, 114–120.
11. Davies,G.J., Gloster,T.M. and Henrissat,B. (2005) Recent structural
insights into the expanding world of carbohydrate-active enzymes.
Curr. Opin. Struct. Biol., 15, 637–645.
12. Doerks,T., Bairoch,A. and Bork,P. (1998) Protein annotation:
detective work for function prediction. Trends Genet., 14, 248–250.
13. Bourne,P.E., Addess,K.J., Bluhm,W.F., Chen,L., Deshpande,N.,
Feng,Z., Fleri,W., Green,R., Merino-Ott,J.C.,
0
Townsend-Merino,W. et al. (2004) The distribution and query
systems of the RCSB Protein Data Bank. Nucleic Acids Res., 32,
1999 2001 2003 2005 2007 D223–D225.
Year 14. Stam,M.R., Danchin,E.G., Rancurel,C., Coutinho,P.M. and
Henrissat,B. (2006) Dividing the large glycoside hydrolase
Figure 3. The number of protein containing CAZy modules were noted in family 13 into subfamilies: towards improved functional
December of the years 1999–2007. Within this set (Open circle), the annotations of alpha-amylase-related proteins. Protein Eng. Des.
number of enzymatically characterized proteins (triangle) and those Sel., 19, 555–562.
with solved structures (open diamond) were also counted. In December 15. Gilks,W.R., Audit,B., De Angelis,D., Tsoka,S. and Ouzounis,C.A.
2007, <10% of proteins in CAZy were characterized enzymatically and (2002) Modeling the percolation of annotation errors in a database
<1% had a solved structure. In 8 years, the number of sequences has of protein sequences. Bioinformatics (Oxford, England), 18,
increased 14-fold, while the number of enzymatic and structural charac- 1641–1649.
terization has mearly doubled. Therefore, the porportion of proteins with 16. Couto,F.M., Silva,J.M. and Coutinho,P.M. (2003) ProFAL:
functional and stuctural information is decreasing rapidly unless high PROtein Functional Annotation through Literature. In Pimentel,E.,
throughput functional efforts are made in this category of enzymes. Brisaboa,N.R. and Gomez, J. (eds), In Proceedings of the 8th
D238 Nucleic Acids Research, 2009, Vol. 37, Database issue

Conference on Software Engineering and Databases, Alicante, Spain, et al. (2006) The genome of black cottonwood,
pp. 747–756. Populus trichocarpa (Torr. & Gray). Science (New York, NY),
17. Yokoyama,R., Rose,J.K. and Nishitani,K. (2004) A surprising 313, 1596–1604.
diversity and abundance of xyloglucan endotransglucosylase/ 19. Geisler-Lee,J., Geisler,M., Coutinho,P.M., Segerman,B.,
hydrolases in rice. Classification and expression analysis. Nishikubo,N., Takahashi,J., Aspeborg,H., Djerbi,S., Master,E.,
Plant Physiol., 134, 1088–1099. Andersson-Gunneras,S. et al. (2006) Poplar carbohydrate-active
18. Tuskan,G.A., Difazio,S., Jansson,S., Bohlmann,J., Grigoriev,I., enzymes. Gene identification and expression analyses.
Hellsten,U., Putnam,N., Ralph,S., Rombauts,S., Salamov,A. Plant Physiol., 140, 946–962.

Downloaded from https://academic.oup.com/nar/article/37/suppl_1/D233/1003505 by guest on 01 December 2023

You might also like