Academia.eduAcademia.edu

A little semantic web goes a long way in biology

2005, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

We show how state-of-the-art Semantic Web technology can be used in e-Science, in particular to automate the classification of proteins in biology. We show that the resulting classification was of comparable quality to one performed by a human expert, and how investigations using the classified data even resulted in the discovery of significant information that had previously been overlooked, leading to the identification of a possible drug-target.

A Little Semantic Web Goes a Long Way in Biology K. Wolstencroft and A. Brass and I. Horrocks and P. Lord and and U. Sattler and R. Stevens and D. Turi School of Computer Science, University of Manchester, UK Abstract. We show how state-of-the-art Semantic Web technology can be used in e-Science, in particular to automate the classification of proteins in biology. We show that the resulting classification was of comparable quality to one performed by a human expert, and how investigations using the classified data even resulted in the discovery of significant information that had previously been overlooked, leading to the identification of a possible drug-target. 1 Introduction Semantic Web research has seen impressive strides in the development of languages, tools, and other infrastructure. In particular, the OWL ontology language, the Protégé ontology editor, and OWL reasoning tools such as FaCT++ and Racer are now in widespread use. In this paper, we report on an application of Semantic Web technology in the domain of biology, where an OWL ontology and an OWL classification tool called the Instance Store were used to automate the classification of protein data. We show that the resulting classification was of comparable quality to one performed by a human expert, and how investigations using the classified data even resulted in the discovery of information that had previously been overlooked. While this example focuses on a particular protein family and a particular set of model organisms, the technique should be applicable to other protein families, and to data from any sequenced genome—in fact we believe that similar techniques should be applicable to a wide range of investigations in biology, and in e-Science more generally. If this proves to be the case, then Semantic Web technology is set to have a major impact on e-Science. Background and Motivation The volume of genomic data is increasing at a seemingly exponential rate. In particular, high throughput technology has enabled the generation of large quantities of DNA sequence information. However, this sequence data needs further analysis before it is useful to most biologists. This process, called annotation, augments the raw DNA sequence, and the protein sequence derived from it, with significant quantities of additional information describing the biological context. One important process during annotation is the classification of proteins into different families. This is an important step in understanding the molecular 2 Authors Suppressed Due to Excessive Length biology of an organism. Attempts to automate this procedure have, however, not generally matched the gold-standard set by human experts. Human expert classification has been more accurate because their expertise allows them to recognise the properties that are sufficient, for example, to place an individual protein into a specific subfamily class. Automated methods have, in contrast, often failed to achieve the same level of specificity. Our goal, therefore, was to improve the precision of automatic protein classification, and bring it up to the same level as that achieved by human experts. Overview of Our Technique Given a set of proteins, each with a (partial) description of its properties, the objective is to find, for each of these proteins, the most specific protein family classes that it is an instance of. To describe protein family classes, we use an OWL-DL ontology; this enables us to specify necessary and sufficient conditions for a protein to be an instance of a given protein class. The ontology models the biology community’s view of the current knowledge of protein classification. We then take protein data derived using standard bioinformatics analysis tools, translate this data into OWL-DL instance descriptions that use terms from the ontology, and use the Instance Store to classify these instances. Empirical Evaluation We have tested our system using data sets from both the human and Aspergillus fumigatus (a pathogenic fungus) genomes. We found that our automatic classification process performed at least as well as a human expert: it allows a fast and repeatable classification process, and the explicit representation of human expert knowledge means that there is a clear and explicit evidence base for the classification. Moreover, the precise and methodical classification of the data led to the discovery of new information about these proteins, including a protein subclass that seems to be specific to pathogenic fungi, and could thus be an important drug-target for pharmaceutical investigations. 2 Science and Technology In this section, we describe the biology problem we have tackled and the Semantic Web technology that we used to achieve an appropriate solution. 2.1 Classifying Proteins The process of annotation follows the “central dogma” of molecular biology. In broad outline, this process consists of the following steps: firstly DNA is sequenced; then genes are identified in this DNA; the DNA is then translated into a protein sequence; the proteins are then analysed and annotated with information useful for further biological investigation. As the majority of the functions of a cell are carried out by proteins, it is the protein sequence that most biologists are interested in. Proteins are classified into families which reflect the functions they carry out in the cell, and often give clear indications as to A Little Semantic Web Goes a Long Way in Biology 3 the biological processes that they are involved in. It is this classification, along with other and diverse kinds of information, which makes up the annotation of a protein and makes the large data sets manageable, enabling biologists to perform more thorough investigations. In the last decade, various steps of this process have been automated, and thus their speed has increased enormously. Sequencing of whole genomes1 is now routine. Gene discovery is technically challenging, but responds well to the increasing availability of CPU cycles. However, this still leaves a large number of protein sequences—approximately 30,000 in the human genome, more or less in other species—generally too many for the individual biologist to cope with. The automation of the annotation process has, however, lagged behind advances in other parts of this process. To date, automated approaches have proven to be quicker than human expert annotation, but the level of detail is often reduced [26, 6]. As a consequence, much of the data about proteins are not annotated with the accurate, specific information necessary, and thus useful resources for further biological dicovery remain untapped. In this investigation, we have used one protein family, the protein phosphatase family, as a case study to demonstrate a new, ontology-based method for automated annotation. This method was designed to combine the speed of automated annotation with the detailed knowledge of expert annotation. Protein phosphatases are a large and varied protein family. Together with another family, the protein kinases, they are critically involved in controlling the activity of many other proteins, thereby forming an essential part of the feedback control mechanism within the cell. Given this pivotal role, it is perhaps unsurprising that many protein phosphatases have been implicated in various diseases of great medical importance, including diabetes, cancer, and neurodegenerative conditions. Phosphatases are therefore a major subject of medical and pharmaceutical research. In general, proteins are relatively modular and comprise of a number of different protein domains. It is often possible to computationally determine the protein domains present in a protein from its sequence. For many protein families, including the protein phosphatases, it is possible to classify their members based on the protein domains they are composed of. To avoid confusion with interpretation domains or the domain of a property, for the remainder of this paper, we use “p-domain” for protein domain. The composition of different p-domains suggests the specific function of a protein, but individual p-domains often have specific and separate functions within this. For example, an enzyme will have a catalytic p-domain that performs the catalysis on the substrate molecule, but it will also contain structural pdomains and binding p-domains that ensure that the substrate can interact with the catalytic p-domain. Therefore, a specific combination of p-domains is required for a protein to function correctly. In some cases, the presence of a certain p-domain is diagnostic for membership in a particular protein family, i.e., some p-domains only occur in a single protein family. If a protein contains 1 A genome is the entirety of DNA in a cell. 4 Authors Suppressed Due to Excessive Length one of these diagnostic p-domains, it must belong to that particular family. For example, the protein tyrosine kinase catalytic p-domain is diagnostic for the tyrosine kinases. Most protein families are, however, defined by a non-trivial combination of p-domains. For example, as you descend the hierarchical structure, extra pdomains (and therefore more specific functional properties) are observed in the protein class definitions. For example, an R5 phosphatase is a type of classical receptor tyrosine phosphatase. As a tyrosine phosphatase, it contains at least one phosphatase catalytic domain and, as a receptor tyrosine phosphatase, it contains a transmembrane region. The R5 type actually contains two catalytic domains and a fibronectin domain, identifying it as an instance of even more specific subclasses. Identifying the p-domain composition of a protein is, therefore, a first step towards its classification. There are databases describing functional p-domains, for example, PROSITE [17], SMART [20] and InterPro [23], and these databases come with specific tools, such as InterPro Scan, which can report the presence of these p-domains in a novel protein sequence; however, bioinformaticians are usually required to perform the analysis that places a protein (with its set of p-domains) into a particular protein family. The whole process of classifying proteins from a genome can be accomplished with the following steps: 1. Given a genome, we extract DNA gene sequences, which we then translate into the set of protein sequences. If we are interested in a particular protein family, we can sub-select sequences containing p-domains diagnostic of that family. 2. On each of the extracted proteins, we use InterPro Scan to determine its p-domain composition. 3. For each of these compositions, we identify the protein family or subfamily to which it belongs by comparing them to the available biological knowledge. The final step currently requires the most human analysis and expert knowledge. Manual classification methods are carried out by protein family experts to interpret this data and use their expert knowledge to classify proteins to a fine-grained level. To the best of our knowledge, no automated method has yet been able to replicate this expert level of detail and precision. 2.2 Ontologies and the Instance Store Ontologies, with their intuitive taxonomic structure and class based semantics, are widely used in domains like bio- and medical-informatics, where there is a tradition of establishing taxonomies of terms. The recent W3C recommendation of OWL2 as the language of choice for web ontologies also underlines the long term vision that ontologies will play a central role in the Semantic Web. Most importantly, as shown in [4], most of the available OWL ontologies can be captured in OWL-DL—a subset of OWL for which highly optimised Description Logic [2] reasoners can be used to support ontology design and deployment. 2 See http://www.w3.org/2004/OWL/ or [11]. A Little Semantic Web Goes a Long Way in Biology 5 Unfortunately, existing reasoners (and tools), while successful in dealing with the (relatively small and static) class level information in ontologies, fail when presented with the large volumes of instance level data often required by realistic applications, hampering the use of reasoning over ontologies beyond the class level. The system we have used—the instance Store (iS) [14]—addresses this problem using a hybrid database/reasoner architecture: a relational database is used to persist instances, while a class level (“TBox” in Description Logic terms) reasoner is used to infer ontological information about the classes they belong to; moreover, part of this ontological information is also persisted in the database. The iS currently only supports a rather limited form of reasoning about individuals: it takes an ontology (without instances), a set of axioms asserting class-instance relationships, and answers queries asking for all the instances of a class description. The classes in both axioms and queries can be arbitrarily complex OWL-DL descriptions, and a DL reasoner is used to ensure that all instances (explicit and implicit) of the query concept are returned. In the remainder of this paper, we use “class-level ontology” for an ontology in which no instances occur. From a theoretical perspective, this might seem un-interesting; the iS is, however, able to deal with much larger numbers of individuals than would be possible using a standard DL reasoner. More importantly, this kind of reasoning turns out to be useful in a range of applications, in particular those such as the one presented here where a domain model is used to structure and classify large data sets.3 There is a long tradition of coupling databases to knowledge representation systems in order to perform reasoning, most notably the work in [5]. However, in the iS, we do not use the standard approach of associating a table (or view) with each class and property. Instead, we have a fixed and relatively simple schema that is independent of the structure of the ontology and of the instance data. The iS is, therefore, agnostic about the provenance of data, and uses a new, dedicated database for each ontology (although the schema is always the same). The basic functionality of the iS system are illustrated in Figure 1. At startup, the initialise method is called with a relational database, an OWL-DL class reasoner such as Racer [9] or Fact++ [30], and a class-level OWL-DL ontology. The method creates the schema for the database if needed (i.e., if the iS is new), parses the ontology, and loads it into the reasoner. To populate the iS, the addAssertion method is called repeatedly. Each assertion states that an instance (identified by a URI) belongs to class (which is an arbitrary OWL-DL description). Once the iS has been populated with some—possibly millions of— instances, it can be queried using the retrieve method. A query again consists of an arbitrary (possibly complex) OWL-DL class description; the result is the set of all instances belonging to the query class, and is returned by retrieve as a set of URIs. The iS uses database queries to return individuals that are “obviously” instances of the query class, and to identify those instances where 3 The iS was initially developed for use in a Web Service registry application, where it was used to classify and retrieve (large numbers of) descriptions of web services. 6 Authors Suppressed Due to Excessive Length the DL reasoner is needed in order to determine if they form part of the answer set. initialise(database: Database, reasoner: OWLReasoner, ontology: OWLOntology) addAssertion(instance: URI, class: OWLDescription) retrieve(query: OWLDescription): Set hURIi Fig. 1. The iS API 3 Description of the Experiments Undertaken The method we present could be applicable in general to many protein families, but to demonstrate the technique and the fine-grained classification possible, we present the analysis of one family, the protein phosphatases, in the human and Aspergillus fumigatus (a pathogenic fungus) genomes. We have combined automated reasoning techniques [9, 14] with elements of a service-oriented architecture [27, 19] to produce a system to automatically extract and classify the set of protein phosphatase from an organism.4 Figure 2 shows the components in our protein classification experiment. An OWL classlevel ontology describes the protein phosphatase family, and this ontology is pre-loaded into the Instance Store. Protein instance data is extracted from the protein set of a genome, and the p-domain composition is determined using InterPro Scan. These p-domain compositions are then translated into OWL descriptions and compared to the OWL definitions for protein family classes using the Instance Store which, in turn, uses a DL reasoner (Racer in this case), to classify each such instance. For each protein, it returns the most specific classes from our ontology that this protein could be found to be an instance of. In the remainder of this section, we will describe the relevant components of this architecture in more detail, and explain the outcomes of this experiment from a biology perspective. In the next section, we describe the experience gained and lessons learnt from a computer science perspective. 3.1 The Ontology In this section, we describe how we capture the expert knowledge for phosphatase classification in an OWL-DL ontology. All the data used for developing our ontology comes from peer-reviewed literature from protein phosphatase experts. The family of human protein phosphatases has been well characterised experimentally, and detailed reviews of the classification and family composition are 4 Due to the relatively small test-set used, the case study reported here could have been carried out using Racer [9] only, i.e., without the iS. However, larger sets of protein data will necessitate the use of iS or a similar tool. A Little Semantic Web Goes a Long Way in Biology 7 Fig. 2. The Ontology Classification System Architecture. available [1, 7, 18]. These reviews represent the current community knowledge of the relevant biology. If, in the future, new subfamilies are discovered, the ontology can easily be changed to reflect these changes in knowledge; we will comment on this in Section 4. Fortunately for this application, there are precise rules,5 based on p-domain composition, for protein family membership, and we can express these rules as class definitions in an OWL-DL ontology. The use of an ontology to capture the understanding of p-domain composition enables the automation of the final analysis step which had previously required human intervention, thus allowing for full automation of the complete process. In biology, the use of ontologies to capture human knowledge of a particular domain and to answer complex queries is becoming well established [8, 28]. Less well established is the use of reasoning systems for data interpretation. In this study, we present a method which makes use of ontology reasoning and illustrates the advantages of such an approach. The ontology was developed in OWL-DL using the Protégé editor, and currently contains 80 classes and 39 properties. Part of the subsumption hierarchy inferrred from these descriptions can be seen in the left-hand panel of Figure 3, which shows the OWL ontology in the Protégé editor. 5 We use “rules” here in a completely informal way. 8 Authors Suppressed Due to Excessive Length Fig. 3. A screenshot of the phosphatase ontology in the OWL ontology editor Protégé. More precisely, for each class of phosphatase, this ontology contains a (necessary and sufficient) definition. For this family of proteins, this definition is, in most cases, a conjunction of p-domain compositions, i.e., a typical case of a phosphatase class definition looks as follows, where Xi are p-domains: If a Y protein contains at least n1 p-domains of type X1 and . . . and at least nm p-domains of type Xm , then this protein also belongs to class Z. For example, receptor tyrosine phosphatases can contain one or two phosphatase catalytic p-domains. In some cases, Xi is a disjunction of p-domains. P-domains come with a rather “flat” structure, i.e., only few p-domains are specialisations of others. Clearly, “counting” statements such as the one above go beyond the expressive power of OWL since they would require (the OWL equivalent of) qualified cardinality restrictions [10], whereas OWL only provides unqualified cardinality restrictions through its restriction(U minCardinality(n)) and restriction(U maxCardinality(n)) constructs. In contrast, this kind of expressive means was provided by DAML+OIL [15], i.e., we could have defined the above mentioned receptor tyrosine phosphatases using the expression IntersectionOf( Restriction(contains minCardinality(1) PhCatalDoms) Restriction(contains maxCardinality(2) PhCatalDoms) A Little Semantic Web Goes a Long Way in Biology 9 To overcome this problem, we devised a work-around in OWL. For each Xi that we would have liked to use in a qualified number restriction, we introduced a subproperty containsX i of contains, and set the range of containsX i to the class Xi . In addition, we added sub-property assertions so that the hierarchy of newly introduced properties containsX i reflects the class hierarchy of the classes Xi used. This work-around is well-known,6 but it is not always correct. That is, assume there are two ontologies, one with qualified number restrictions and one that resulted from the application of this work around to it. Then there are cases where the first one implies a subsumption relationship between two classes, whereas the second one does not imply this subsumption. Similarly, a class may be unsatisfiable w.r.t. the first one, but satisfiable w.r.t. the second. However, in the special case of our experiment, this work around was correct, even though we are not going to prove this here. We will comment more on this in Section 4 and 5. Having captured the expert knowledge in this way, we are left with the problem of dealing with the potentially very large numbers of protein instances that need to be classified according to the corresponding ontology. This requirement motivated our use of the iS. 3.2 The Data Sets This study focuses on the previously identified and described human phosphatases [1, 24], and the less well characterised A.fumigatus protein phosphatases. The human phosphatases, having been carefully hand-classified, form a control group for our automated protein phosphatase classification. Previous classification of human phosphatases by biological experts provides a substantial test-set for our approach. If the iS can classify the characterised proteins (at least) as well as human experts, then this would increase our confidence when using our method on unknown genomes. The A.fumigatus genome falls between these extremes, and thus offers a unique insight into the comparison between the automated method and the manual. The A.fumigatus genome has been sequenced, and annotation is currently underway by a team of human experts [22]. We have considered 118 human phosphatases and 45 from A.fumigatus. Pre-Screening Isolation of the protein phosphatase sequences from the protein set of the genome was achieved by screening for diagnostic phosphatase motifs, i.e. for specific patterns. These are 1. the protein tyrosine phosphatase active site motif H-C-X(5)-R 2. the protein serine/threonine phos-phatase motif [LIVMN]-[KR]-G-N-H-E 3. the protein phosphatase C signature motif [LIVMFY]-[LIVMFYA]-[GSAC][LIVM]-[FYC]-D-G-H-[GAV]. The EMBOSS program, PatMatDB [25] was used to perform the pre-screening process. Performing an InterPro Scan on every protein sequence from the 6 See, e.g., http://www.cs.vu.nl/∼guus/public/qcr.html 10 Authors Suppressed Due to Excessive Length genome would also have isolated the protein phosphatase sequences, but each InterPro Scan can take several minutes. PatMatDB can screen the whole genome in the time taken to run one InterPro Scan, so we decided to use InterPro Scan only for the detailed analysis of each sequence identified as being a protein phosphatase. 3.3 Queries Asked and Results The purposes of the human and A.fumigatus studies were different. The human study was a proof of concept to demonstrate the effectiveness of the automated method. The A.fumigatus study was more focused towards biological discovery. For the human phosphatases, we were interested in comparing the automated classification with the human expert classification. Therefore, we browsed the class hierarchy of our phosphatase ontology and, for each class, we retrieved those proteins for which the iS inferred that this class was the most specific one. We were also interested in identifying instances that did not fit any of the ontology class definitions (i.e., whose most specific class was the top class). For the A.fumigatus phosphatases, we browsed the class hierarchy in a similar way but, as the phosphatases from this organism were less well characterised, we were paricularly interested in the differences between the human and A.fumigatus set, i.e., we were interested in finding classes that had instances of the human proteins, but not of the A.fumigatus proteins, and vice versa. All these queries could be answered easily and quickly using the iS. The results of this experiment were three-fold. Firstly, we found that the automated classification of the human protein phosphatases performed as well as the manual classification by phosphatase experts. Since the same protein instances were used in the automated and manual studies, we could compare these two classifications, and it turned out that both classifications put almost all phosphatases into the same place in the class hierarchy. This evidence shows proof of concept, and suggests that the automated approach could be used to solve the current annotation bottleneck. Secondly, in the few cases where the automatic and the manual classification differed, detailed investigations by a domain expert revealed that the automatic one was actually “more correct”: we discovered two proteins for which no appropriate class was available, i.e. they were classified by the automatic classification as instances of the top class. This discovery lead to a modification of the ontology, and thus of the expert knowledge on proteins. One of these phosphatases was DUSP10 (Dual specificity phosphatase 10). It was found to contain an extra p-domain, a disintegrin. The p-domain is not found in any other protein phosphatase and poses interesting questions about possible protein functions to the biologists. Our automated classification method was able to find these mis-classifications because the iS applied the expert knowledge systematically and consistently. Thirdly, the automated classification of the A.fumigatus phosphatases revealed large differences from the human phospahtases. Not only were there fewer individual proteins, but whole subfamilies were missing. Some of these differences can be attributed to the differences in the two organisms. Many phosphatases in A Little Semantic Web Goes a Long Way in Biology 11 the human classification were tissue-specific variations of tissue-types that do not occur in A.fumigatus. Since A.fumigatus is pathogenic to humans, these differences are important avenues of investigation for potential drug targets. The most intersting discovery in the A.fumigatus data set was the identification of a novel type of calcineurin phosphatase, i.e., again, a phosphatase that was classified automatically only as an instance of the top class. Calcineurin is well conserved throughout evolution and performs the same function in all organisms. However, in A.fumigatus, it contains an extra functional domain. Further bioinformatics analyses revealed that this extra domain also occurs in other pathogenic fungus species, but in no other organisms, suggesting a specific functional role for this extra p-domain. Previous studies have identified divergences in the mechanism of action of calcineurin in pathogenic fungi as being linked to virulence, so this protein is an intersting drug-target for future study. 4 Lessons Learnt As we have seen, we have successfully used Semantic Web technology in a bioinformatics application. Besides finding new protein families that are of interest to biologists, we have shown that automated classification can indeed compete with manual classification, and is sometimes even superior. Our approach to automated classification combines the advantages of speed of the automated methods and accuracy of human expert classification, the latter being due to the fact the we captured the expert knowledge in an OWL ontology. The combination of the two, namely speed and expert knowledge, provides a quick and efficient method for classifying proteins on a genomic scale, and offers a solution to the current annotation bottleneck. Our approach was made possible by the development of state-of-the-art Semantic Web technology, such as the OWL ontology language, the Protégé OWL ontology editor, the OWL Instance Store, and the Racer OWL reasoner; this technology did not emerge overnight, but is based on decades of research in logic-based knowledge representation and reasoning. Although neither Racer nor the iS support all of OWL-DL,7 these tools proved more than adequate for our experiment. In contrast, a limitation in the expressive power of OWL-DL did cause considerable problems: the lack of qualified number restrictions (also called qualified cardinality restrictions). In order to overcome this limitation we had to employ a work around, and verify that this work around, even though not correct in general, was correct for our ontology and instance data. This work around introduced a significant overhead, and was only possible through a close co-operation between the biologists and computer scientists. We thus cannot recommend such an approach in general. Additionally, we observe that, from a theoretical and 7 Racer does not support individual names in complex class descriptions (so-called nominals—see [16]), and the current version of iS does not support role assertions between individuals. 12 Authors Suppressed Due to Excessive Length practical perspective, this work around should not be necessary since (a) reasoners such as Racer and Fact [9, 13] support qualified number restrictions, (b) for all Description Logics we are aware of that support (unqualified) number restrictions, the worst-case complexity of reasoning remains the same when they are extended with qualified number restrictions (see, e.g., [29]), and (c) the latest version of Protégé-OWL is supporting qualified number restrictions. Hence we can, in the future, run similar experiments without having to resort to this work around, provided that we are willing to diverge from the current OWL standard. The ability to run such experiments is of considerable importance since there is a wealth of unannotated and partially annotated data in the public domain, to which we plan to apply our approach. New genomes are being sequenced continually, and some existing genomes have not been annotated to any degree of detail. Now that the ontology system architecture is in place, new proteins can be quickly and successfully classified as members of protein phosphatase subfamilies. Development of other ontologies, would enable the application of this technique to some of the 1,000’s of other protein families. This paper demonstrates a proof of concept for the automated classification of proteins using automated reasoning technologies. From a study involving a single protein family and two species, we were able to identify a new protein subclass. As this class of protein appears to be specific to pathogenic fungi, it is potentially useful for further pharmaceutical investigations. Automated reasoning over instance data has therefore enabled us to generate new hypotheses which will require signficiant further laboratory experimentation, which, in turn, will potentially improve our understanding of protein phosphorylation. Finally, we would like to point out that the ontology definitions are produced from expert protein family knowledge. Therefore, they reflect what is currently known in the research community, and are made explicit in a machineunderstandable format, namely OWL-DL. This has several important consequences. Firstly, the construction of such an ontology can help in the development of a consensus from within the community [3], and even if the community fails to agree on a single ontology, automated classification could be used to enable “parallel” alternative annotations. Secondly, if the community knowledge of the protein family changes, the ontology can easily be altered, and the protein instances can be re-classified accordingly. Lastly, if the definitions are based on what is known, proteins that do not fit into any of the defined classes are easily identified, making the discovery of new protein subfamilies possible. 5 Outlook and Future Work Our plans for future work are manifold. Basically, we want to do more “automated” biology, but we are thereby pushing the current state-of-art in logic-based knowledge representation, automated reasoning, and Semantic Web technology. Within this section, we only discuss three of the related issues. Firstly, we observe that a protein is a sequence of amino acids, and thus sequences can be seen as strings over a twenty letter alphabet since there are A Little Semantic Web Goes a Long Way in Biology 13 only twenty amino acids. In our current ontology, we do not capture this sequence information, and thus cannot answer queries related to these sequences. From a biology perspective, however, queries such as “give me all proteins whose amino acid sequence contains an M followed by some arbitrary sub string, which is then followd by a N EN ” would be really valuable. From a computer science perspective, we could easily express (and query over) these strings using a simple form of concrete domains, so-called datatypes [21, 12]. However, the datatypes currently available in OWL do not provide predicates that compare a given string with a regular expression, a comparison that would reflect the above example query. Secondly, we are currently concerned with a single class of components of an organism, namely the proteins. In the future, we want to use the available technology to automate investigations into their interaction, and also represent and reason about larger structures such as genomes and cells. We could easily model interactions between proteins using a property interact to make statements such as “proteins of class X only interact with proteins of class Y”. However, we would also need to make statements on an instance level such as “this protein instance interacts with that protein instance”, which is possible in OWL-DL, but goes beyond the capabilities of the current iS. We are currently extending the iS to handle statements of this kind, and we will see if this extension is able to cope with the large volumes of data that will be needed in biology applications.8 Thirdly, we will “roll back” the work-around we used to cope with the absence of qualified number restrictions, both in our ontology and in the instance data, instead using the form of qualified number restrictions provided by Protégé, Racer, and the iS. This will greatly enhance the interpretability of the current ontology and also make its extension to other families of proteins more straightforward. Acknowledgements This work was funded by an MRC PhD studentship and myGrid e-science project, University of Manchester with the UK e-science programme EPSRC grant GR/R67743. Preliminary sequence data was obtained from The Institute for Genomic Research website at http://www.tigr.org from Dr Jane MabeyGilsenan. Sequencing of A.fumigatus was funded by the National Institute of Allergy and Infectious Disease U01 AI 48830 to David Denning and William Nierman, the Wellcome Trust, and Fondo de Investicagiones Sanitarias. References 1. A. Alonso, J. Sasin, N. Bottini, I. Friedberg, I. Friedberg, A. Osterman, A. Godzik, T. Hunter, J. Dixon, and T. Mustelin. Protein tyrosine phosphatases in the human genome. Cell, 117(6):699–711, Jun 11 2004. 8 Racer can already handle such statements, but can only deal with a relatively small number of individuals. 14 Authors Suppressed Due to Excessive Length 2. Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. 3. M. Bada, D. Turi, R. McEntire, and R. Stevens. Using Reasoning to Guide Annotation with Gene Ontology Terms in GOAT. SIGMOD Record (special issue on data engineering for the life sciences), June 2004. 4. Sean Bechhofer and Raphael Volz. Patching syntax in OWL ontologies. In Proc. of the 3rd International Semantic Web Conference (ISWC), 2004. 5. Alexander Borgida and Ronald J. Brachman. Loading data into description reasoners. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 217–226, 1993. 6. K. Carter, A. Oka, G. Tamiya, and M. I. Bellgard. Bioinformatics issues for automating the annotation of genomic sequences. Genome Inform Ser Workshop Genome Inform, 12:204–11, 2001. 7. P. T. Cohen. Novel protein serine/threonine phosphatases: variety is the spice of life. Trends Biochem Sci, 22(7):245–51, July 1997. 8. Gene Ontology Consortium. Gene ontolgy: Tool for the unification of biology. Nature Genetics, 25(1):25–29, 2000. 9. Volker Haarslev and Ralf Möller. RACER system description. In Proceedings of the International Joint Conference on Automated Reasoning (IJCAR-01), volume 2083 of Lecture Notes in Artificial Intelligence, pages 701–705. Springer-Verlag, 2001. 10. B. Hollunder and F. Baader. Qualifying number restrictions in concept languages. In Proceedings of the Second International Conference on the Principles of Knowledge Representation and Reasoning (KR-91), pages 335–346, Boston, MA, USA, 1991. 11. I. Horrocks, P. F. Patel-Schneider, and F. van Harmelen. From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 1(1), 2003. 12. I. Horrocks and U. Sattler. Ontology reasoning in the shoq(d) description logic. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, 2001. 13. Ian Horrocks. Using an expressive description logic: FaCT or fiction? In Proceedings of the Sixth International Conference on the Principles of Knowledge Representation and Reasoning (KR-98), pages 636–647, 1998. 14. Ian Horrocks, Lei Li, Daniele Turi, and Sean Bechhofer. The instance store: DL reasoning with large numbers of individuals. In Proc. of the 2004 Description Logic Workshop (DL 2004), 2004. available at CEUR, www.ceur.org, see also instancestore.man.ac.uk. 15. Ian Horrocks, Peter F. Patel-Schneider, and Frank van Harmelen. Reviewing the design of DAML+OIL: An ontology language for the semantic web. In Proc. of the 18th Nat. Conf. on Artificial Intelligence (AAAI 2002), pages 792–797. AAAI Press, 2002. 16. Ian Horrocks and Ulrike Sattler. A tableaux decision procedure for SHOIQ. In Proc. of the 19th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005), 2005. To appear. 17. N. Hulo, C. J. Sigrist, V. Le Saux, P. S. Langendijk-Genevaux, L. Bordoli, A. Gattiker, E. De Castro, P. Bucher, and A. Bairoch. Recent improvements to the prosite database. Nucleic Acids Res, 32(Database issue):D134–7, Jan 1 2004. 18. P. J. Kennelly. Protein phosphatases–a phylogenetic perspective. Chem Rev, 101(8):2291–312, August 2001. A Little Semantic Web Goes a Long Way in Biology 15 19. K.Wolstencroft, P.Lord, L.Tabernero, A.Brass, and R.Stevens. Intelligent classification of proteins using an ontology. Submitted, 2005. 20. I. Letunic, R. R. Copley, S. Schmidt, F. D. Ciccarelli, T. Doerks, J. Schultz, C. P. Ponting, and P. Bork. Smart 4.0: towards genomic data integration. Nucleic Acids Res, 32(Database issue):D142–4, Jan 1 2004. 21. C. Lutz. Description logics with concrete domains—a survey. In Advances in Modal Logics Volume 4. World Scientific Publishing Co. Pte. Ltd., 2003. 22. J. E. Mabey, M. J. Anderson, P. F. Giles, C. J. Miller, T. K. Attwood, N. W. Paton, E. Bornberg-Bauer, G. D. Robson, S. G. Oliver, and D. W. Denning. Cadre: the central aspergillus data repository. Nucleic Acids Res, 32(Database issue):D401–5, Jan 1 2004. 23. N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bradley, P. Bork, P. Bucher, L. Cerutti, R. Copley, E. Courcelle, U. Das, R. Durbin, W. Fleischmann, J. Gough, D. Haft, N. Harte, N. Hulo, D. Kahn, A. Kanapin, M. Krestyaninova, D. Lonsdale, R. Lopez, I. Letunic, M. Madera, J. Maslen, J. McDowall, A. Mitchell, A. N. Nikolskaya, S. Orchard, M. Pagni, C. P. Ponting, E. Quevillon, J. Selengut, C. J. Sigrist, V. Silventoinen, D. J. Studholme, R. Vaughan, and C. H. Wu. Interpro, progress and status in 2005. Nucleic Acids Res, 33(Database issue):D201–5, Jan 1 2005. 24. T. Mustelin, T. Vang, and N. Bottini. Protein tyrosine phosphatases and the immune response. Nat Rev Immunol, 5(1):43–57, January 2005. 25. P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biology open software suite. Trends Genet, 16(6):276–7, June 2000. 26. T. F. Smith and X. Zhang. The challenges of genome sequence annotation or ”the devil is in the details”. Nat Biotechnol, 15(12):1222–3, November 1997. 27. R.D. Stevens, H.J. Tipney, C.J. Wroe, T.M. Oinn, M. Senger, P.W. Lord, C.A. Goble, A. Brass, and M. Tassabehji. Exploring Williams Beuren Syndrome Using MyGrid. In Bioinformatics, volume 20, pages i303–310, 2004. Intelligent Systems for Molecular Biology (ISMB) 2004. 28. Robert Stevens, Chris Wroe, Phillip Lord, and Carole Goble. Ontologies in bioinformatics. In Stefan Staab and Rudi Studer, editors, Handbook on Ontologies, pages 635–657. Springer, 2003. 29. S. Tobies. Complexity Results and Practical Algorithms for Logics in Knowledge Representation. PhD thesis, RWTH Aachen, 2001. electronically available at http: //www.bth.rwth-aachen.de/ediss/ediss.html. 30. Dmitry Tsarkov and Ian Horrocks. Efficient reasoning with range and domain constraints. In Proceedings of the 2004 Description Logic Workshop (DL 2004). CEUR, 2004. Available from ceur-ws.org.