Academia.eduAcademia.edu

Recent Advances in Research on Radiofrequency Fields and Health

2001, Journal of Toxicology and Environmental Health Part B: Critical Reviews

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones* Bioinformatics Unit, Department of Computer Science, University College London, Gower Street, London, UK, WC1E 6BT Abstract Motivation Converting the vast quantity of free-format text found in journals into a concise, structured format makes the researcher’s quest for information easier. Recently, several information extraction systems have been developed that attempt to simplify the retrieval and analysis of biological and medical data. Most of this work has used the abstract alone, owing to the convenience of access and the quality of data. Abstracts are generally available through central collections with easy direct access (e.g. PubMed). The full-text papers contain more information, but are distributed across many locations (e.g. publishers’ web sites, journal web sites, and local repositories), making access more difficult. In this paper, we present BioRAT, a new information extraction (IE) tool, specifically designed to perform biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a Biological Research Assistant for Text mining, and incorporates a document search ability with domain specific IE. Results We show first, that BioRAT performs as well as existing systems, when applied to abstracts; and second, that significantly more information is available to BioRAT through the full-length papers than via the abstracts alone. Typically, less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. Overall, BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.6% recall with 51.25% precision on full-length papers. Availability: The software and documentation can be found at http://bioinf.cs.ucl.ac.uk/biorat. Contact: [email protected]; [email protected] 1 Introduction developed by Blaschke and Valencia (2001, 2002), and which will be discussed in more detail below. There have been several attempts to apply IE techniques to scientific papers, but these have used only the abstract of each paper. Example applications include protein-protein interactions (Thomas, Milward, Ouzounis, Pulman & Carroll 2000); using machine learning to classify biological relationships (Craven & Kumlien 1999); and protein structure and residues (Gaizauskas, Demetriou, Artymiuk & Willett 2003). Abstracts are readily available in large numbers (e.g. through PubMed1 ), are available in plain text, and typically have no superscript or subscript characters, no footnotes, and so forth. This avoids potential difficulties in interpreting unusual symbols, Greek letters etc. However, the abstract is only a summary of the paper in question; the full text will typically include more detail that may be of direct interest to the reader. BioRAT is designed to extract information both from abstracts and from full text, and in this work, we use BioRAT to compare information extraction from abstracts and from full-length papers. A “challenge evaluation” has recently been proposed, to encourage researchers to focus on a particular task, allowing a direct comparison of their systems. As described by Yeh, Hirschman & Morgan (2003), full-length articles were used in the challenge, after they had been manually “cleaned” to convert Greek letters, super- and sub-scripts, and italics, into marked-up plain text. Furthermore, a list of genes mentioned in each paper was also available to entrants. While necessary for a formal evaluation, such resources are not generally available to The rapid and ongoing growth in the number of biological and medical publications means that researchers can no longer read more than a small proportion of the literature in their field. Yet interesting and useful information, relevant to the researcher, could appear in papers they have not read and therefore be missed entirely. Accompanying this growth in literature is the increasing proportion of electronically available papers, as most publishers now produce on-line versions of their journals. But while this may ease access, there is still a vast quantity that a researcher may feel they should read, with no concomitant increase in their ability to do so. Information retrieval helps researchers to find papers, but it still leaves a large amount of reading to be done. Information extraction (IE) goes one stage further, and analyses the papers on behalf of the researcher. IE systems achieve this by identifying semantic structures in the text, and in so doing, distill an entire document down to the key facts. BioRAT can be regarded as a research assistant that is given a query and, autonomously, finds a set of papers, reads them, and highlights the most relevant facts in each. BioRAT uses natural language processing techniques and domain-specific knowledge to search for patterns in documents, with the aim of identifying interesting facts. These facts can then be extracted to produce a database of information, which has a higher “information density” than a pile of papers. This is similar to an information extraction system that has recently been ∗ To 1 http://www.ncbi.nlm.nih.gov/entrez/ whom correspondence should be addressed. 1 text mining systems, so we have not used them here. Also, Yeh et al. (2003) state that PDF papers “were not suitable for processing by most text mining systems”, and so the contest was limited to those papers that were available in HTML format. BioRAT uses the full-length paper whenever it is available, instead of just the abstract, and uses PDF files directly from the Internet. PDF is one of the most widely used formats for research papers on the Internet. The rest of this paper is organised as follows. In the next section, we describe the BioRAT system, and its key components. We then discuss two experiments which evaluate BioRAT using the DIP database, and discuss the results. We follow the advice of Blaschke & Valencia (2001), who specifically recommend the use of DIP as a “realistic scenario for the comparison of IE systems”. 2 (a) Finding papers on the web System Outline We designed BioRAT to give people with no IE experience a powerful tool to help them locate and analyse research papers. The system therefore combines tools to locate papers, to download full-length papers, to extract information from papers, and to design templates to allow this extraction. Typically, the user enters a query into BioRAT, which is then passed on to PubMed. The user is then presented with a list of papers, from which they can choose to download abstracts or, where available, full-length papers. Having obtained some text, the user can then apply some pre-existing templates, or create their own. In either case, the templates match patterns in the text that contains “useful” information, which is extracted for display to the user, and for possible incorporation into a database. Figure 1 shows screenshots of this process. 2.1 (b) Designing a template (c) Extracting information Web Spidering Figure 1: Screen shots showing BioRAT in use. (a) The document search interface. The user enters a query at the top, and BioRAT accesses PubMed via the Internet. A list of matching titles (with date of publication, author etc.) is shown on the left, and the user can select any item to view the abstract, on the right, or to download the full-length paper. (b) The BioRAT template design component. The user can view a document, select target words (or a phrase) from it, and then define a template in terms of parts of speech, gazetteer headings, word stems, or the words themselves. Gazetteers can also be viewed and edited through the same interface. (c) The results from templates designed to recognise protein-protein interactions. Four interactions are shown, with the context quoted from the source text. A command line interface is also available. One distinctive feature of BioRAT is that it automatically locates and acquires full length papers wherever possible, instead of just using abstracts. It does this via the Internet, by following a series of hyperlinks to find each target paper. To find a particular paper, BioRAT starts with a URL (web address) provided by the PubMed database. It then goes to that web page and identifies the hyperlinks there, and recursively follows links until it finds the target paper, in PDF format. This is downloaded and converted to a text-only version, ready for the IE engine. Finding the target paper is non-trivial for such a tool. The URL provided by PubMed (and ultimately, by the journal publishers) does not point to the paper itself, but rather to a web page from which the paper can be 2 accessed. The spider’s task is to find the target paper by following a series of hyperlinks. The system works by downloading the web page, identifying and evaluating all the links in it, and iteratively following the highest-scoring link, with scores based on simple keyword matching. Having located and downloaded a PDF file, it is converted into plain text for later analysis. To ensure that the correct paper has been identified, and that the text conversion process has succeeded, the first part of the plain text file is compared with the corresponding abstract obtained directly from PubMed, using a fuzzy string matching routine. BioRAT only attempts to locate and download PDF papers, as this is by far the most widely used format. While some have suggested a move towards using XML for distributing research papers (Murray-Rust & Rzepa 2002), papers in this format are not generally available to biological researchers. It is also unlikely that existing archives would be marked-up manually. Having obtained some relevant documents, the system then attempts to extract interesting facts from them. 2.2 2.2.1 One task in IE is “named entity recognition”, which aims to identify key items within text. For example, we may want to identify words that are people’s names, company names, proteins, genes and so on. Once identified, these words or phrases can then be matched by the templates. One simple approach that we adopt is to use a gazetteer. A gazetteer is a list of words identifying members of a particular category. For example, one gazetteer may list names of proteins, while another lists names of people. BioRAT incorporates gazetteers from three sources, namely MeSH3 , Swiss-Prot4, and hand-made lists. The top two levels of the MeSH hierarchy contain a total of approximately 120 entries, each of which was used to define a separate gazetteer. Each of the almost 22,000 entries in MeSH was extracted and added to the appropriate gazetteer(s). Further gazetteers were derived from Swiss-Prot. Each entry from Swiss-Prot describes a single protein, but proteins often have many synonyms, all of which are included in the relevant gazetteer. Also, some authors refer to proteins in terms of the genes that encode them, so the gene names were also extracted, and used to create another gazetteer. To supplement these two sources, two further gazetteers were created by hand. These consisted of words that covered concepts of interest that were not already in other gazetteers. One consisted of 30 words describing the interaction of proteins (e.g. “bind”, “downregulate”, “interact” and so on). The other consisted of a few further synonyms of proteins not already covered by the other gazetteers. These hand-made gazetteers were initially created following domain expert advice, and subsequently modified as required. Information extraction engine Information extraction (IE) is a key part of BioRAT’s functionality. The aim of IE is to extract from a set of documents the key facts about prespecified types of events, objects and relationships. These facts are then used automatically to populate a database. This can then be used to ease on-line access. The heart of BioRAT is an IE engine, based on the GATE toolbox2 , produced at Sheffield University (Cunningham, Maynard, Bontcheva & Tablan 2002). GATE is a general purpose text engineering system, whose modular and flexible design allows us to use it to create a more specialised biological IE system. One issue common to any biological information extraction system is that many protein and gene names are easily mistaken for common words. For example, the Swiss-Prot database includes entries with names “mice”, “was” and “alpha”, as well as 26 single-letter gene names. The problem is to distinguish whether the word “was” refers to a gene or is simply the past tense of the verb “to be”, for example. Sometimes, this can be resolved by considering the case of the letters, but this is not reliable. Instead, BioRAT uses GATE to label words according to their parts of speech, and then applies a filter that rejects determinants, verbs etc. as not being proteins. This provides one possible advantage of BioRAT. Two components of GATE that must be modified for our domain-specific application are gazetteers and templates, which we shall now discuss in turn. 2 General Gazetteers 2.2.2 Templates A template is a representation of a text pattern that allows us to extract information automatically. It consists of a number of predefined slots to be filled by the system from information contained in the text. One of the simplest templates from BioRAT is: “interaction of” (PROTEIN 1) “and” (PROTEIN 2) Here, “PROTEIN 1” and “PROTEIN 2” are slots to be filled with names of proteins, as defined by a gazetteer. The contextual phrase (“interaction of”) is a fixed string: only phrases containing those exact words will be matched by this particular template. For example, the template shown would identify the sentence ‘Genetic evidence for the interaction of Pex7p and Pex13p is 3 Medical Subject Hierarchy — http://www.nlm.nih.gov/ mesh/ 4 http://www.expasy.org/ Architecture for Text Engineering – http://gate.ac. uk/ 3 provided. . . ’ and extract from it the interaction (Pex7p then click on individual words in that document, whose ↔ Pex13p)5 . properties are then shown on the screen. The properties used are: part-of-speech tag; gazetteer headings; the A slightly more complicated template is: word stem; and the word itself. The user can click on (EXPRESSION) “of” (PROTEIN 1) these properties to append them to the current tem( WORD )? ( WORD )? ( WORD )? plate pattern, along with various wildcard and boolean (“by” | “to”| “with”) options, and build up a sequence of terms. This can then (PROTEIN 2) “and” (PROTEIN 3) be applied as a template to the current document, and the results displayed. The user can then cycle between Here, “EXPRESSION” refers to a gazetteer contain- editing the template and viewing the results, until sating words relating to protein expression and interaction, isfied. Once saved, the template can then be applied to such as “bind” and “inhibit”. The slot (WORD)? is a large set of papers using the main BioRAT template a wildcard that matches any word, but is optional, so matching interface. Alternatively, the user can select the sequence (WORD)? (WORD)? (WORD)? matches an entire phrase, and the system will create a default between zero and three consecutive words of any type. template based on that phrase, which the user can subAs before, the three (PROTEIN x) slots match protein sequently edit as required. The tool can also be used to names, and the quoted strings must be matched exactly. view and edit gazetteers. The | character is a logical “OR”. For example, this template matches part of the sentence “Specific binding of Rna15 in complex with Hrp1 and Rna14 creates a poly- 3 Using DIP to evaluate BioRAT merase pause site. . . ”, and identifies two interactions: (Rna15 ↔ Hrp1) and (Rna15 ↔ Rna14), with the ex- Having described the BioRAT system, and considered the documents that it can be used to analyse, we now pression type “binding”. As with comparable IE systems, such as those men- turn to a particular study to test the usefulness of the tioned in the introduction, the templates in BioRAT are system. For this, we used the Database of Interact6 written by hand. There have been attempts at auto- ing Proteins (DIP ; (Xenarios, Salwinski, Duan, Higney, matic template creation (Collier 1998), but these have Kim & Eisenberg 2002)). Blaschke & Valencia (2001) not been broadly applicable. Although template design recommend using DIP as a way of evaluating biologitakes time and requires some practice, it does allow the cal IE systems, because it represents a realistic probuser to maintain full control over what information is ex- lem of practical interest to biological researchers. IE tracted, and allows experts to incorporate their knowl- researchers can use their systems to extract proteinedge within the system. Because of this, BioRAT in- protein interactions, and then compare these with the corporates a template design tool, designed to allow or- records in DIP. This does not rely on the interpretadinary users to create their own templates with little tion of the authors, and so gives greater confidence in the results. By re-creating (a manageable subset of) effort, as discussed in the next section. BioRAT produces data in XML format, which can be DIP, we can calculate the recall and precision of difreadily imported into existing database query systems. ferent systems, and compare the results. The recall (or The same data is produced simultaneously as HTML “sensitivity”) is the fraction of target records that the IE 7 and as a comma-separated list, for viewing in applica- system correctly re-creates . Precision is a measure of tions such as a browser or a spreadsheet, if that is more how much of the output of an IE system is correct, and convenient for the user. Each record in the resulting is defined as the ratio of the number of correct positive predictions to the total number of positive predictions database represents a single completed template. made8 . Each record in DIP defines a pair of proteins that in2.3 Template design tool teract with each other, and provides citations of papers One feature that BioRAT shares with several other text that describe the interaction. Proteins are defined by enmining systems is the need for a set of templates to be try keys to Swiss-Prot, GenBank or PIR. For simplicity, developed for each task. This is often a time consum- we only consider DIP records containing two Swiss-Prot ing process that requires expertise in both text mining identifiers. For each experiment, we started by selecting a subset and the problem domain. BioRAT includes a template design tool with a graphical user interface, which allows of DIP. BioRAT can analyse papers rapidly, typically non-expert users to develop templates without having to taking just a few seconds to complete its analysis of each learn a complex new language. To use it, the user first abstract. However, for our experiments, the results need selects a document that is then displayed. The user can 6 http://dip.doe-mbi.ucla.edu #true positives = #true positives+#false negatives 7 Recall 5 We use the format “X ↔ Y” to represent any form of interaction between two proteins, X and Y. 8 Precision 4 = #true positives #true positives+#false positives to be manually checked in order to calculate the recall and precision rates, and this time-consuming task forces us to limit the targets to a manageable subset of DIP. Having selected some DIP records, as detailed below, we then used BioRAT to process the corresponding papers, using both the abstract and full-text versions. We manually compared the predictions made by BioRAT to the source DIP records to measure the recall. For each record in DIP, we searched through the output of BioRAT corresponding to the same paper, and checked to see if the interaction mentioned in DIP had been identified. Similarly, we measured the precision by manually counting how many of the records produced by BioRAT were correctly extracted from the text. Throughout this work, we used the January 2003 version (“dip20030105”) of DIP, the March 2003 version of Swiss-Prot, and the 2003 edition of MeSH. 4 4.1 Table 1 shows the recall from these abstracts by BioRAT, namely 20.31%. This is a similar recall to that achieved by SUISEKI. The results can be compared with the larger study reported by Blaschke & Valencia (2002), where 190 DIP interactions were correctly detected, from a possible set of 851 interactions, giving a recall score of 22.33%. We can compare the “abstract” results (Table 1) to the results in Blaschke & Valencia (2002), if we assume the results follow a binomial distribution. Our recall rate of 20.31% from 389 trials gives a variance of σ 2 = 389 × 0.2031 × (1 − 0.2031) = 62.96 and hence a standard deviation of σ = 7.934. Blaschke et al. quote a recall of 190 cases from 851 trials, giving a recall rate of 190/851 = 0.2233. If they had achieved the same rate on our smaller sample, we would expect them to achieve 389 × 0.2233 = 86.86 successes. This is within one standard deviation of our success score, so we can say that both systems are performing with approximately the same recall. Experiments Comparison with SUISEKI Result Match No match Totals In this section, we compare BioRAT with the existing SUISEKI information extraction engine described by Blaschke & Valencia (2001, 2002). We compare the performance of BioRAT to that of their system by measuring the recall of BioRAT on a sample of papers from DIP that were also used by Blaschke & Valencia (2001). This provides a suitable benchmark for BioRAT. The SUISEKI system, like BioRAT, uses gazetteers derived from Swiss-Prot and DIP to identify protein names. To extract information, it uses “frames”, which are similar to BioRAT’s templates in that they define patterns of language that form the basis for IE. However, the frames in SUISEKI make less use of linguistic knowledge, and more use of statistics. For example, the frames in SUISEKI distinguish between nouns and verbs, but do not recognise conjunctions, adjectives or any other parts of speech. Also, they count the number of words occurring in a phrase, and favour short phrases over long ones. There were 389 records from DIP, which were used by Blaschke & Valencia (2001) and have a DIP record that refers to two Swiss-Prot records. These 389 DIP records relate to 229 PubMed citations. We applied BioRAT to all 229 abstracts, and then analysed the results by hand. We used a total of 19 templates, initially derived from the SUISEKI frames and subsequently modified by hand; and 127 gazetteers, derived from MeSH and other sources, as described earlier. The templates and gazetteers used here can be accessed from the same website as the BioRAT software, http://bioinf.cs. ucl.ac.uk/biorat. Initial trials revealed weaknesses in both the templates and the gazetteers, which were subsequently improved. BioRAT Cases Percent 79 20.31 310 79.69 389 100.00 SUISEKI Cases Percent 190 22.33 661 77.67 851 100.00 Table 1: Comparison of BioRAT and SUISEKI on recall from abstracts. BioRAT results from 389 DIP records, derived from 229 abstracts. SUISEKI results from 851 DIP records, derived from 514 abstracts. The former set of records is a subset of the latter. 4.2 Abstracts vs. Full-length papers In the second experiment, we want to assess the benefits of using the full-length version of a paper, rather than just the abstract. Clearly, one would expect to extract more information from the full paper, than just the abstract. However, obtaining full-length papers requires extra time and resources, in terms of locating and downloading them, processing the extra text, storing extra files, and so on. If the gain in recall is small, this may not be worth the extra effort. Also, we need to discover whether the conversion of PDF papers to text loses too much information, such as Greek letters and super-/sub-script information, and to discover the effect on precision of having a lot of extra text. We took a random sample of 211 DIP records, based on 130 different documents, where full text and abstract are both available. We used BioRAT to extract proteinprotein interactions from both, and then compared the results. We were, of course, limited to articles that are available electronically. For example, this excluded most papers that were published before the mid-1990s, when 5 most journals were paper-only. Also, the experiments described here were carried out using computers at UCL, and so we could only access full-length papers from journals to which UCL subscribes, or are freely available. Table 2 shows the results. The information extraction rate obtained from full-length papers was 43.6%, with more than half of the information coming from the body of the paper, and the rest from the abstract. This clearly shows the benefit of locating and analysing the full text of a paper, rather than restricting information extraction to just the abstract. Using a similar binomial analysis to that described earlier, we can also test whether this improvement is significantly better than the information extracted from just the abstracts. The standard p deviation of the recall from the abstracts is σ = 211 × 0.1800 × (1 − 0.1800) = 5.582. Thus the recall score using full-length papers is more than seven standard deviations below the recall score using full-length papers, clearly a significant result. Result Match in abstract Match in full text (but not in abstract ) No match Totals Cases 38 54 Percent 18.00 25.60 119 211 56.40 100.00 Result Correct Protein id Template Totals Full-length Cases Percent 205 51.25 119 29.75 76 19.00 400 100.00 Table 3: Precision analysis. Here, “correct” refers to records where the interaction information was extracted correctly from the text, regardless of whether that interaction is in DIP. “Template” refers to failures caused by imperfect templates, and “protein id” refers to failures to recognise proteins. the papers was correctly extracted, whether or not the information is in DIP. In order to understand where BioRAT fails, we analysed the output when BioRAT failed to extract the correct information from the documents, also shown in Table 3. Around two-thirds of the mistakes are caused by failure to identify the correct proteins. Each protein is typically known by several different names, and may also be referred to by its associated gene, which itself may have several distinct names. Furthermore, long names may be abbreviated by the authors, producing further non-standard ways of referring to the protein. The gazetteer used in these experiments included more than 230,000 gene names and more than 99,000 protein names, but still failed to recognise a large number of proteins. One example of this protein identification failure comes from DIP “edge” record DIP:43E. The corresponding Swiss-Prot entry (P15172) refers to the protein “Myoblast determination protein 1”, and lists synonyms “Myogenic factor 3” and “Myf-3”, with gene names “MYOD1” and “MYF3”. However, the paper in question9 refers repeatedly to “MYOD”. While this is clearly the same protein, a slightly different abbreviation has been used by the author compared to those included in Swiss-Prot. The gazetteer used by BioRAT is derived principally from Swiss-Prot, and so BioRAT failed to recognise this protein, and hence failed to extract this interaction. Most of the remaining failures are due to imperfections in the set of templates used by BioRAT. Although these errors could no doubt be reduced by improving the templates, there is no clear way to achieve this without a significant manual effort, even with BioRAT’s template design tool. Thus template design remains a major issue in information extraction research (Cowie & Wilks 2000). Even when BioRAT fails to extract the relevant information, it may still highlight the correct piece of text. For example, DIP record DIP:800E defines an interac- Table 2: Recall results from 211 DIP records, derived from 130 full-length papers. The total recall from fulllength papers is 18.0 + 25.6 = 43.6%. 4.3 Abstracts Cases Percent 239 55.07 125 28.80 70 16.13 434 100.00 Precision Having analysed the recall of BioRAT in the previous sections, we now turn to precision. In our experiments, precision is somewhat harder to measure than recall, because we need an estimate of the number of false positives. If a record produced by BioRAT is not found in DIP, it could be that a) it is a false-positive example, reducing the precision of BioRAT; or b) the record is missing from DIP. The latter case consists of interactions that are mentioned in papers, but have not (yet) been added to DIP. For the first experiment described earlier, BioRAT produced 434 interaction records, derived from 229 abstracts. We manually re-analysed these records with no reference to DIP but instead we counted how many of BioRAT’s predictions were correctly extracted from the text, and what sort of mistakes it made. We repeated this for a sample of 400 from the total of over 10,000 records produced in the analysis of the corresponding full-length papers. Table 3 shows the results. Around half of all records produced by BioRAT are correct, in the sense that the information contained in 9 PMID 6 9184158 tion between proteins p53 and UBE2I. BioRAT failed to BioRAT in that way. BioRAT can also be used from identify this interaction, but did extract this sentence10 : the command line, allowing non-interactive batch processing, and potentially reducing the impact of a slow Since the tumor suppresser protein p53 and a execution time for full-length papers. Since it is written newly identified ubiquitin-like protein (UBL1) in Java, BioRAT can be run on almost any platform, are implicated in the RAD51/RAD52 comand has been tested successfully under Linux, Solaris, plex. . . , we further tested their associations MacOS and MS Windows. with UBE2I. Note that BioRAT correctly identified the above sen5 Discussion tence as defining the interaction between RAD51 and RAD52, even though it missed the target interaction. As expected, the density of “interesting” facts found in the abstract is much higher than the corresponding den4.4 Example output sity in the full text. This is at least in part because fulllength papers include background discussion, a descripFrom PubMed ID 9012827, BioRAT found the intertion of the method, references and so on. While these action (Swi6 ↔ Hrr25), which corresponds to the DIP are necessary to set the work in context, and to provide “edge” record DIP:250E. BioRAT quoted the following supporting evidence, they may not contain the kind of sentence: information that BioRAT is attempting to extract. Swi6 was also phosphorylated by Hrr25 kinase 250 immunoprecipitated from yeast extracts with a HA tag (Fig. 2B). A similar, but slightly more complex template can recognise two interactions at once. The following sentence11 correctly lead BioRAT to produce two records for the interactions (Pcf11 ↔ Rna14) and (Pcf11 ↔ Rna15). Number of records found 200 Since Pcf11 interacts simultaneously with Rna14 and Rna15, its role in vivo may also be to stabilize their interaction. 150 100 50 A less successful example comes from this sentence: Many interactions between nucleoporins and nuclear transport receptors have already been identified; however, we were unable to detect a biochemical interaction between Cse1p and Nup2p. 0 0 10 20 30 40 50 60 Location in document (%) 70 80 90 100 Figure 2: Location of information extracted from full length papers. Location 0% is the start of the paper; BioRAT incorrectly predicted that Cse1p interacted location 100% is the end. The peak on the left correwith Nup2p, whereas the text is less conclusive. sponds to the abstract; the larger peak in the middle corresponds to the results and discussion sections of the source papers. 4.5 Speed and memory The time it takes BioRAT to analyse a piece of text depends on the size of the text, the size of gazetteers, and the complexity of the templates. In the work described here, BioRAT typically took 3-5 seconds to analyse each abstract, and 6-10 minutes to analyse each full-length paper, running on a standard desktop PC (a single 1.7 GHz CPU), and used around 500Mb of RAM. Given that each paper can be analysed independently, largescale applications of text mining lend themselves well to distributed processing, although we have yet to use 10 PMID 11 PMID Figure 2 is one view of information density. It shows the location of each fact extracted from the set of fulllength papers used earlier. Because different journals divide papers into sections in different ways, we only consider the location of the information relative to the entire paper. The peak on the left shows that a lot of information is found at the start of the paper, corresponding approximately to the title and abstract. The dip in the graph around 10-30% shows that relatively little information is extracted from the next section, typically the introduction and methods sections. There is another larger peak around 40-80%, corresponding to the 8921390 11689698 7 results and discussion sections, which contain a large amount of relevant information, before tailing off towards the end of the paper, which is typically a citation list. Note that many interactions were found more than once, through repetition within or between papers, and the graph shows the location of all the extracted information, including duplicates. These peaks show from where most of the information has been extracted, but the troughs are also of interest. Even the least informative parts of papers still contain considerable amounts of information. This strongly suggests that the entire paper should be analysed, wherever possible, and not just a few selected sections. Although the task is different, this contrasts with the behaviour of some of the teams described by Yeh et al. (2003), who restricted analysis to certain sections of the papers. Even when BioRAT (or any other IE system) fails to find a particular relationship, or incorrectly predicts a relationship not mentioned in the text, it is quite possible that it has found an interesting part of an interesting document. In this way, using IE to guide a literature search is perfectly feasible, even if the recall and precision are a long way from the ideal 100%. The template design tool allows biological researchers, with no text mining experience, to design, test and use a sophisticated template-based information extraction system. This flexibility allows BioRAT to be applied to a wide range of problems without a large overhead, in contrast to many comparable systems, which require both biological and text mining expertise for them to be used fully. The results that BioRAT produces can be stored and retrieved using a variety of interfaces, easing the user’s access to the information. Furthermore, BioRAT also provides quotes from the source texts, and links directly to the source papers and related databases. In this way, BioRAT behaves like a virtual research assistant, guiding the user towards interesting papers. 6 converting them into a usable plain text format. However, these costs are outweighed by the fact that more than twice as much relevant information can then be extracted automatically. Acknowledgments This work was sponsored by GlaxoSmithKline. References Blaschke, C. & Valencia, A. (2001), ‘Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study’, Comparative and Functional Genomics 2, 196–206. Blaschke, C. & Valencia, A. (2002), ‘The frame-based module of the SUISEKI information extraction system’, IEEE Intelligent Systems 17(2), 14–20. Collier, R. (1998), Automatic template creation for information extraction, PhD thesis, Department of Computer Science, University of Sheffield. Cowie, J. & Wilks, Y. (2000), Information extraction, in R. Dale, H. Moisl & H. Somers, eds, ‘Handbook of Natural Language Processing’, Marcel Dekker, New York. Craven, M. & Kumlien, J. (1999), Constructing biological knowledge-bases by extracting information from text sources, in ‘Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology’, Germany, pp. 77–86. Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V. (2002), GATE: A framework and graphical development environment for robust NLP tools and applications, in ‘Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02)’, Philadelphia, USA. Gaizauskas, R., Demetriou, G., Artymiuk, P. & Willett, P. (2003), ‘Protein structures and information extraction from biological texts: The PASTA system’, Journal of Bioinformatics 19(1), 135–143. Murray-Rust, P. & Rzepa, H. S. (2002), ‘STMML. A markup language for scientific, technical and medical publishing’, Data Science 1(2), 1–65. Conclusions In this paper, we have presented BioRAT, an informa- Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. (2000), Automatic extraction of protein intertion extraction system specially designed to process biactions from scientific abstracts, in ‘Pacific Symposium ological research papers. A distinguishing feature of on Biocomputing 5’, pp. 538–549. BioRAT is that it uses full-length papers, rather than being limited to abstracts as previous studies have been. Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S. & Eisenberg, D. (2002), ‘DIP: The Database of InteractThe recall and precision performance of BioRAT was ing Proteins. A research tool for studying cellular netassessed by use of the DIP database of protein-protein works of protein interactions’, Nucleic Acids Research interactions, and the recall was compared with that of 30(1), 303–305. a previous system, SUISEKI, which processed only the abstracts. The recall performance of BioRAT on the Yeh, A., Hirschman, L. & Morgan, A. (2003), ‘Evaluation of text data mining for database curation: lessons abstracts alone (20%) was similar to that of SUISEKI. learned from the KDD Challenge Cup’, Bioinformatics Overall, BioRAT achieved 43% recall and over 50% pre19(Suppl 1), i331–i339. cision on full-length papers. Extra time is required to obtain the full-length papers, and there are difficulties in 8