BioRAT: Extracting Biological Information from Full-length Papers
David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones*
Bioinformatics Unit, Department of Computer Science, University College London, Gower Street, London, UK, WC1E 6BT
Abstract
Motivation Converting the vast quantity of free-format text found in journals into a concise, structured format
makes the researcher’s quest for information easier. Recently, several information extraction systems have been
developed that attempt to simplify the retrieval and analysis of biological and medical data. Most of this work
has used the abstract alone, owing to the convenience of access and the quality of data. Abstracts are generally
available through central collections with easy direct access (e.g. PubMed). The full-text papers contain more
information, but are distributed across many locations (e.g. publishers’ web sites, journal web sites, and local
repositories), making access more difficult.
In this paper, we present BioRAT, a new information extraction (IE) tool, specifically designed to perform
biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a
Biological Research Assistant for Text mining, and incorporates a document search ability with domain specific
IE.
Results We show first, that BioRAT performs as well as existing systems, when applied to abstracts; and second,
that significantly more information is available to BioRAT through the full-length papers than via the abstracts
alone. Typically, less than half of the available information is extracted from the abstract, with the majority
coming from the body of each paper. Overall, BioRAT recalled 20.31% of the target facts from the abstracts with
55.07% precision, and achieved 43.6% recall with 51.25% precision on full-length papers.
Availability: The software and documentation can be found at http://bioinf.cs.ucl.ac.uk/biorat.
Contact:
[email protected];
[email protected]
1
Introduction
developed by Blaschke and Valencia (2001, 2002), and
which will be discussed in more detail below.
There have been several attempts to apply IE techniques to scientific papers, but these have used only the
abstract of each paper. Example applications include
protein-protein interactions (Thomas, Milward, Ouzounis, Pulman & Carroll 2000); using machine learning
to classify biological relationships (Craven & Kumlien
1999); and protein structure and residues (Gaizauskas,
Demetriou, Artymiuk & Willett 2003).
Abstracts are readily available in large numbers (e.g.
through PubMed1 ), are available in plain text, and typically have no superscript or subscript characters, no
footnotes, and so forth. This avoids potential difficulties in interpreting unusual symbols, Greek letters etc.
However, the abstract is only a summary of the paper in
question; the full text will typically include more detail
that may be of direct interest to the reader. BioRAT
is designed to extract information both from abstracts
and from full text, and in this work, we use BioRAT
to compare information extraction from abstracts and
from full-length papers.
A “challenge evaluation” has recently been proposed,
to encourage researchers to focus on a particular task,
allowing a direct comparison of their systems. As described by Yeh, Hirschman & Morgan (2003), full-length
articles were used in the challenge, after they had been
manually “cleaned” to convert Greek letters, super- and
sub-scripts, and italics, into marked-up plain text. Furthermore, a list of genes mentioned in each paper was
also available to entrants. While necessary for a formal
evaluation, such resources are not generally available to
The rapid and ongoing growth in the number of biological and medical publications means that researchers
can no longer read more than a small proportion of the
literature in their field. Yet interesting and useful information, relevant to the researcher, could appear in
papers they have not read and therefore be missed entirely. Accompanying this growth in literature is the
increasing proportion of electronically available papers,
as most publishers now produce on-line versions of their
journals. But while this may ease access, there is still
a vast quantity that a researcher may feel they should
read, with no concomitant increase in their ability to do
so.
Information retrieval helps researchers to find papers,
but it still leaves a large amount of reading to be done.
Information extraction (IE) goes one stage further, and
analyses the papers on behalf of the researcher. IE systems achieve this by identifying semantic structures in
the text, and in so doing, distill an entire document
down to the key facts.
BioRAT can be regarded as a research assistant that
is given a query and, autonomously, finds a set of papers, reads them, and highlights the most relevant facts
in each. BioRAT uses natural language processing techniques and domain-specific knowledge to search for patterns in documents, with the aim of identifying interesting facts. These facts can then be extracted to produce
a database of information, which has a higher “information density” than a pile of papers. This is similar to
an information extraction system that has recently been
∗ To
1 http://www.ncbi.nlm.nih.gov/entrez/
whom correspondence should be addressed.
1
text mining systems, so we have not used them here.
Also, Yeh et al. (2003) state that PDF papers “were not
suitable for processing by most text mining systems”,
and so the contest was limited to those papers that were
available in HTML format. BioRAT uses the full-length
paper whenever it is available, instead of just the abstract, and uses PDF files directly from the Internet.
PDF is one of the most widely used formats for research
papers on the Internet.
The rest of this paper is organised as follows. In
the next section, we describe the BioRAT system, and
its key components. We then discuss two experiments
which evaluate BioRAT using the DIP database, and
discuss the results. We follow the advice of Blaschke
& Valencia (2001), who specifically recommend the use
of DIP as a “realistic scenario for the comparison of IE
systems”.
2
(a) Finding papers on the web
System Outline
We designed BioRAT to give people with no IE experience a powerful tool to help them locate and analyse
research papers. The system therefore combines tools
to locate papers, to download full-length papers, to extract information from papers, and to design templates
to allow this extraction.
Typically, the user enters a query into BioRAT, which
is then passed on to PubMed. The user is then presented with a list of papers, from which they can choose
to download abstracts or, where available, full-length
papers. Having obtained some text, the user can then
apply some pre-existing templates, or create their own.
In either case, the templates match patterns in the text
that contains “useful” information, which is extracted
for display to the user, and for possible incorporation
into a database. Figure 1 shows screenshots of this process.
2.1
(b) Designing a template
(c) Extracting information
Web Spidering
Figure 1: Screen shots showing BioRAT in use. (a) The
document search interface. The user enters a query at
the top, and BioRAT accesses PubMed via the Internet.
A list of matching titles (with date of publication, author etc.) is shown on the left, and the user can select
any item to view the abstract, on the right, or to download the full-length paper. (b) The BioRAT template
design component. The user can view a document, select target words (or a phrase) from it, and then define
a template in terms of parts of speech, gazetteer headings, word stems, or the words themselves. Gazetteers
can also be viewed and edited through the same interface. (c) The results from templates designed to recognise protein-protein interactions. Four interactions are
shown, with the context quoted from the source text. A
command line interface is also available.
One distinctive feature of BioRAT is that it automatically locates and acquires full length papers wherever
possible, instead of just using abstracts. It does this
via the Internet, by following a series of hyperlinks to
find each target paper. To find a particular paper,
BioRAT starts with a URL (web address) provided by
the PubMed database. It then goes to that web page
and identifies the hyperlinks there, and recursively follows links until it finds the target paper, in PDF format.
This is downloaded and converted to a text-only version,
ready for the IE engine.
Finding the target paper is non-trivial for such a tool.
The URL provided by PubMed (and ultimately, by the
journal publishers) does not point to the paper itself,
but rather to a web page from which the paper can be
2
accessed. The spider’s task is to find the target paper
by following a series of hyperlinks.
The system works by downloading the web page, identifying and evaluating all the links in it, and iteratively
following the highest-scoring link, with scores based on
simple keyword matching. Having located and downloaded a PDF file, it is converted into plain text for
later analysis. To ensure that the correct paper has been
identified, and that the text conversion process has succeeded, the first part of the plain text file is compared
with the corresponding abstract obtained directly from
PubMed, using a fuzzy string matching routine.
BioRAT only attempts to locate and download PDF
papers, as this is by far the most widely used format.
While some have suggested a move towards using XML
for distributing research papers (Murray-Rust & Rzepa
2002), papers in this format are not generally available
to biological researchers. It is also unlikely that existing
archives would be marked-up manually.
Having obtained some relevant documents, the system
then attempts to extract interesting facts from them.
2.2
2.2.1
One task in IE is “named entity recognition”, which aims
to identify key items within text. For example, we may
want to identify words that are people’s names, company
names, proteins, genes and so on. Once identified, these
words or phrases can then be matched by the templates.
One simple approach that we adopt is to use a gazetteer.
A gazetteer is a list of words identifying members of
a particular category. For example, one gazetteer may
list names of proteins, while another lists names of people. BioRAT incorporates gazetteers from three sources,
namely MeSH3 , Swiss-Prot4, and hand-made lists.
The top two levels of the MeSH hierarchy contain a total of approximately 120 entries, each of which was used
to define a separate gazetteer. Each of the almost 22,000
entries in MeSH was extracted and added to the appropriate gazetteer(s). Further gazetteers were derived
from Swiss-Prot. Each entry from Swiss-Prot describes a
single protein, but proteins often have many synonyms,
all of which are included in the relevant gazetteer. Also,
some authors refer to proteins in terms of the genes that
encode them, so the gene names were also extracted, and
used to create another gazetteer.
To supplement these two sources, two further
gazetteers were created by hand. These consisted of
words that covered concepts of interest that were not already in other gazetteers. One consisted of 30 words describing the interaction of proteins (e.g. “bind”, “downregulate”, “interact” and so on). The other consisted of
a few further synonyms of proteins not already covered
by the other gazetteers. These hand-made gazetteers
were initially created following domain expert advice,
and subsequently modified as required.
Information extraction engine
Information extraction (IE) is a key part of BioRAT’s
functionality. The aim of IE is to extract from a set
of documents the key facts about prespecified types of
events, objects and relationships. These facts are then
used automatically to populate a database. This can
then be used to ease on-line access.
The heart of BioRAT is an IE engine, based on
the GATE toolbox2 , produced at Sheffield University
(Cunningham, Maynard, Bontcheva & Tablan 2002).
GATE is a general purpose text engineering system,
whose modular and flexible design allows us to use it to
create a more specialised biological IE system. One issue
common to any biological information extraction system
is that many protein and gene names are easily mistaken for common words. For example, the Swiss-Prot
database includes entries with names “mice”, “was” and
“alpha”, as well as 26 single-letter gene names. The
problem is to distinguish whether the word “was” refers
to a gene or is simply the past tense of the verb “to
be”, for example. Sometimes, this can be resolved by
considering the case of the letters, but this is not reliable. Instead, BioRAT uses GATE to label words according to their parts of speech, and then applies a filter
that rejects determinants, verbs etc. as not being proteins. This provides one possible advantage of BioRAT.
Two components of GATE that must be modified for
our domain-specific application are gazetteers and templates, which we shall now discuss in turn.
2 General
Gazetteers
2.2.2
Templates
A template is a representation of a text pattern that allows us to extract information automatically. It consists
of a number of predefined slots to be filled by the system from information contained in the text. One of the
simplest templates from BioRAT is:
“interaction of” (PROTEIN 1)
“and” (PROTEIN 2)
Here, “PROTEIN 1” and “PROTEIN 2” are slots
to be filled with names of proteins, as defined by a
gazetteer. The contextual phrase (“interaction of”) is a
fixed string: only phrases containing those exact words
will be matched by this particular template. For example, the template shown would identify the sentence ‘Genetic evidence for the interaction of Pex7p and Pex13p is
3 Medical Subject Hierarchy — http://www.nlm.nih.gov/
mesh/
4 http://www.expasy.org/
Architecture for Text Engineering – http://gate.ac.
uk/
3
provided. . . ’ and extract from it the interaction (Pex7p then click on individual words in that document, whose
↔ Pex13p)5 .
properties are then shown on the screen. The properties used are: part-of-speech tag; gazetteer headings; the
A slightly more complicated template is:
word stem; and the word itself. The user can click on
(EXPRESSION) “of” (PROTEIN 1)
these properties to append them to the current tem( WORD )? ( WORD )? ( WORD )?
plate pattern, along with various wildcard and boolean
(“by” | “to”| “with”)
options, and build up a sequence of terms. This can then
(PROTEIN 2) “and” (PROTEIN 3)
be applied as a template to the current document, and
the results displayed. The user can then cycle between
Here, “EXPRESSION” refers to a gazetteer contain- editing the template and viewing the results, until sating words relating to protein expression and interaction, isfied. Once saved, the template can then be applied to
such as “bind” and “inhibit”. The slot (WORD)? is a large set of papers using the main BioRAT template
a wildcard that matches any word, but is optional, so matching interface. Alternatively, the user can select
the sequence (WORD)? (WORD)? (WORD)? matches an entire phrase, and the system will create a default
between zero and three consecutive words of any type. template based on that phrase, which the user can subAs before, the three (PROTEIN x) slots match protein sequently edit as required. The tool can also be used to
names, and the quoted strings must be matched exactly. view and edit gazetteers.
The | character is a logical “OR”. For example, this template matches part of the sentence “Specific binding of
Rna15 in complex with Hrp1 and Rna14 creates a poly- 3
Using DIP to evaluate BioRAT
merase pause site. . . ”, and identifies two interactions:
(Rna15 ↔ Hrp1) and (Rna15 ↔ Rna14), with the ex- Having described the BioRAT system, and considered
the documents that it can be used to analyse, we now
pression type “binding”.
As with comparable IE systems, such as those men- turn to a particular study to test the usefulness of the
tioned in the introduction, the templates in BioRAT are system. For this, we used the Database of Interact6
written by hand. There have been attempts at auto- ing Proteins (DIP ; (Xenarios, Salwinski, Duan, Higney,
matic template creation (Collier 1998), but these have Kim & Eisenberg 2002)). Blaschke & Valencia (2001)
not been broadly applicable. Although template design recommend using DIP as a way of evaluating biologitakes time and requires some practice, it does allow the cal IE systems, because it represents a realistic probuser to maintain full control over what information is ex- lem of practical interest to biological researchers. IE
tracted, and allows experts to incorporate their knowl- researchers can use their systems to extract proteinedge within the system. Because of this, BioRAT in- protein interactions, and then compare these with the
corporates a template design tool, designed to allow or- records in DIP. This does not rely on the interpretadinary users to create their own templates with little tion of the authors, and so gives greater confidence in
the results. By re-creating (a manageable subset of)
effort, as discussed in the next section.
BioRAT produces data in XML format, which can be DIP, we can calculate the recall and precision of difreadily imported into existing database query systems. ferent systems, and compare the results. The recall (or
The same data is produced simultaneously as HTML “sensitivity”) is the fraction of target records that the IE
7
and as a comma-separated list, for viewing in applica- system correctly re-creates . Precision is a measure of
tions such as a browser or a spreadsheet, if that is more how much of the output of an IE system is correct, and
convenient for the user. Each record in the resulting is defined as the ratio of the number of correct positive
predictions to the total number of positive predictions
database represents a single completed template.
made8 .
Each record in DIP defines a pair of proteins that in2.3 Template design tool
teract with each other, and provides citations of papers
One feature that BioRAT shares with several other text that describe the interaction. Proteins are defined by enmining systems is the need for a set of templates to be try keys to Swiss-Prot, GenBank or PIR. For simplicity,
developed for each task. This is often a time consum- we only consider DIP records containing two Swiss-Prot
ing process that requires expertise in both text mining identifiers.
For each experiment, we started by selecting a subset
and the problem domain. BioRAT includes a template
design tool with a graphical user interface, which allows of DIP. BioRAT can analyse papers rapidly, typically
non-expert users to develop templates without having to taking just a few seconds to complete its analysis of each
learn a complex new language. To use it, the user first abstract. However, for our experiments, the results need
selects a document that is then displayed. The user can
6
http://dip.doe-mbi.ucla.edu
#true positives
= #true positives+#false
negatives
7 Recall
5 We
use the format “X ↔ Y” to represent any form of interaction between two proteins, X and Y.
8 Precision
4
=
#true positives
#true positives+#false positives
to be manually checked in order to calculate the recall
and precision rates, and this time-consuming task forces
us to limit the targets to a manageable subset of DIP.
Having selected some DIP records, as detailed below,
we then used BioRAT to process the corresponding papers, using both the abstract and full-text versions. We
manually compared the predictions made by BioRAT
to the source DIP records to measure the recall. For
each record in DIP, we searched through the output of
BioRAT corresponding to the same paper, and checked
to see if the interaction mentioned in DIP had been identified. Similarly, we measured the precision by manually
counting how many of the records produced by BioRAT
were correctly extracted from the text. Throughout this
work, we used the January 2003 version (“dip20030105”)
of DIP, the March 2003 version of Swiss-Prot, and the
2003 edition of MeSH.
4
4.1
Table 1 shows the recall from these abstracts by
BioRAT, namely 20.31%. This is a similar recall to
that achieved by SUISEKI. The results can be compared
with the larger study reported by Blaschke & Valencia
(2002), where 190 DIP interactions were correctly detected, from a possible set of 851 interactions, giving a
recall score of 22.33%.
We can compare the “abstract” results (Table 1) to
the results in Blaschke & Valencia (2002), if we assume the results follow a binomial distribution. Our
recall rate of 20.31% from 389 trials gives a variance of
σ 2 = 389 × 0.2031 × (1 − 0.2031) = 62.96 and hence a
standard deviation of σ = 7.934. Blaschke et al. quote a
recall of 190 cases from 851 trials, giving a recall rate of
190/851 = 0.2233. If they had achieved the same rate
on our smaller sample, we would expect them to achieve
389 × 0.2233 = 86.86 successes. This is within one standard deviation of our success score, so we can say that
both systems are performing with approximately the
same recall.
Experiments
Comparison with SUISEKI
Result
Match
No match
Totals
In this section, we compare BioRAT with the existing SUISEKI information extraction engine described
by Blaschke & Valencia (2001, 2002). We compare the
performance of BioRAT to that of their system by measuring the recall of BioRAT on a sample of papers from
DIP that were also used by Blaschke & Valencia (2001).
This provides a suitable benchmark for BioRAT.
The SUISEKI system, like BioRAT, uses gazetteers
derived from Swiss-Prot and DIP to identify protein
names. To extract information, it uses “frames”, which
are similar to BioRAT’s templates in that they define
patterns of language that form the basis for IE. However, the frames in SUISEKI make less use of linguistic knowledge, and more use of statistics. For example,
the frames in SUISEKI distinguish between nouns and
verbs, but do not recognise conjunctions, adjectives or
any other parts of speech. Also, they count the number
of words occurring in a phrase, and favour short phrases
over long ones.
There were 389 records from DIP, which were used by
Blaschke & Valencia (2001) and have a DIP record that
refers to two Swiss-Prot records. These 389 DIP records
relate to 229 PubMed citations. We applied BioRAT to
all 229 abstracts, and then analysed the results by hand.
We used a total of 19 templates, initially derived
from the SUISEKI frames and subsequently modified
by hand; and 127 gazetteers, derived from MeSH and
other sources, as described earlier. The templates and
gazetteers used here can be accessed from the same
website as the BioRAT software, http://bioinf.cs.
ucl.ac.uk/biorat. Initial trials revealed weaknesses
in both the templates and the gazetteers, which were
subsequently improved.
BioRAT
Cases Percent
79
20.31
310
79.69
389
100.00
SUISEKI
Cases Percent
190
22.33
661
77.67
851
100.00
Table 1: Comparison of BioRAT and SUISEKI on recall
from abstracts. BioRAT results from 389 DIP records,
derived from 229 abstracts. SUISEKI results from 851
DIP records, derived from 514 abstracts. The former set
of records is a subset of the latter.
4.2
Abstracts vs. Full-length papers
In the second experiment, we want to assess the benefits of using the full-length version of a paper, rather
than just the abstract. Clearly, one would expect to extract more information from the full paper, than just
the abstract. However, obtaining full-length papers requires extra time and resources, in terms of locating and
downloading them, processing the extra text, storing extra files, and so on. If the gain in recall is small, this
may not be worth the extra effort. Also, we need to
discover whether the conversion of PDF papers to text
loses too much information, such as Greek letters and
super-/sub-script information, and to discover the effect
on precision of having a lot of extra text.
We took a random sample of 211 DIP records, based
on 130 different documents, where full text and abstract
are both available. We used BioRAT to extract proteinprotein interactions from both, and then compared the
results. We were, of course, limited to articles that are
available electronically. For example, this excluded most
papers that were published before the mid-1990s, when
5
most journals were paper-only. Also, the experiments
described here were carried out using computers at UCL,
and so we could only access full-length papers from journals to which UCL subscribes, or are freely available.
Table 2 shows the results. The information extraction rate obtained from full-length papers was 43.6%,
with more than half of the information coming from the
body of the paper, and the rest from the abstract. This
clearly shows the benefit of locating and analysing the
full text of a paper, rather than restricting information
extraction to just the abstract.
Using a similar binomial analysis to that described earlier, we can also test whether this improvement is significantly better than the information extracted from just the abstracts. The standard
p deviation of the recall from the abstracts is σ =
211 × 0.1800 × (1 − 0.1800) = 5.582. Thus the recall
score using full-length papers is more than seven standard deviations below the recall score using full-length
papers, clearly a significant result.
Result
Match in abstract
Match in full text
(but not in abstract )
No match
Totals
Cases
38
54
Percent
18.00
25.60
119
211
56.40
100.00
Result
Correct
Protein id
Template
Totals
Full-length
Cases Percent
205
51.25
119
29.75
76
19.00
400
100.00
Table 3: Precision analysis. Here, “correct” refers to
records where the interaction information was extracted
correctly from the text, regardless of whether that interaction is in DIP. “Template” refers to failures caused by
imperfect templates, and “protein id” refers to failures
to recognise proteins.
the papers was correctly extracted, whether or not the
information is in DIP. In order to understand where
BioRAT fails, we analysed the output when BioRAT
failed to extract the correct information from the documents, also shown in Table 3. Around two-thirds of the
mistakes are caused by failure to identify the correct
proteins. Each protein is typically known by several different names, and may also be referred to by its associated gene, which itself may have several distinct names.
Furthermore, long names may be abbreviated by the authors, producing further non-standard ways of referring
to the protein. The gazetteer used in these experiments
included more than 230,000 gene names and more than
99,000 protein names, but still failed to recognise a large
number of proteins.
One example of this protein identification failure
comes from DIP “edge” record DIP:43E. The corresponding Swiss-Prot entry (P15172) refers to the protein
“Myoblast determination protein 1”, and lists synonyms
“Myogenic factor 3” and “Myf-3”, with gene names
“MYOD1” and “MYF3”. However, the paper in question9 refers repeatedly to “MYOD”. While this is clearly
the same protein, a slightly different abbreviation has
been used by the author compared to those included in
Swiss-Prot. The gazetteer used by BioRAT is derived
principally from Swiss-Prot, and so BioRAT failed to
recognise this protein, and hence failed to extract this
interaction.
Most of the remaining failures are due to imperfections in the set of templates used by BioRAT. Although
these errors could no doubt be reduced by improving the
templates, there is no clear way to achieve this without
a significant manual effort, even with BioRAT’s template design tool. Thus template design remains a major issue in information extraction research (Cowie &
Wilks 2000).
Even when BioRAT fails to extract the relevant information, it may still highlight the correct piece of text.
For example, DIP record DIP:800E defines an interac-
Table 2: Recall results from 211 DIP records, derived
from 130 full-length papers. The total recall from fulllength papers is 18.0 + 25.6 = 43.6%.
4.3
Abstracts
Cases Percent
239
55.07
125
28.80
70
16.13
434
100.00
Precision
Having analysed the recall of BioRAT in the previous
sections, we now turn to precision. In our experiments,
precision is somewhat harder to measure than recall,
because we need an estimate of the number of false positives. If a record produced by BioRAT is not found in
DIP, it could be that a) it is a false-positive example,
reducing the precision of BioRAT; or b) the record is
missing from DIP. The latter case consists of interactions that are mentioned in papers, but have not (yet)
been added to DIP.
For the first experiment described earlier, BioRAT
produced 434 interaction records, derived from 229 abstracts. We manually re-analysed these records with no
reference to DIP but instead we counted how many of
BioRAT’s predictions were correctly extracted from the
text, and what sort of mistakes it made. We repeated
this for a sample of 400 from the total of over 10,000
records produced in the analysis of the corresponding
full-length papers. Table 3 shows the results.
Around half of all records produced by BioRAT are
correct, in the sense that the information contained in
9 PMID
6
9184158
tion between proteins p53 and UBE2I. BioRAT failed to BioRAT in that way. BioRAT can also be used from
identify this interaction, but did extract this sentence10 : the command line, allowing non-interactive batch processing, and potentially reducing the impact of a slow
Since the tumor suppresser protein p53 and a
execution time for full-length papers. Since it is written
newly identified ubiquitin-like protein (UBL1)
in Java, BioRAT can be run on almost any platform,
are implicated in the RAD51/RAD52 comand has been tested successfully under Linux, Solaris,
plex. . . , we further tested their associations
MacOS and MS Windows.
with UBE2I.
Note that BioRAT correctly identified the above sen5 Discussion
tence as defining the interaction between RAD51 and
RAD52, even though it missed the target interaction.
As expected, the density of “interesting” facts found in
the abstract is much higher than the corresponding den4.4 Example output
sity in the full text. This is at least in part because fulllength papers include background discussion, a descripFrom PubMed ID 9012827, BioRAT found the intertion of the method, references and so on. While these
action (Swi6 ↔ Hrr25), which corresponds to the DIP
are necessary to set the work in context, and to provide
“edge” record DIP:250E. BioRAT quoted the following
supporting evidence, they may not contain the kind of
sentence:
information that BioRAT is attempting to extract.
Swi6 was also phosphorylated by Hrr25 kinase
250
immunoprecipitated from yeast extracts with a
HA tag (Fig. 2B).
A similar, but slightly more complex template can
recognise two interactions at once. The following sentence11 correctly lead BioRAT to produce two records
for the interactions (Pcf11 ↔ Rna14) and (Pcf11 ↔
Rna15).
Number of records found
200
Since Pcf11 interacts simultaneously with
Rna14 and Rna15, its role in vivo may also
be to stabilize their interaction.
150
100
50
A less successful example comes from this sentence:
Many interactions between nucleoporins and
nuclear transport receptors have already been
identified; however, we were unable to detect
a biochemical interaction between Cse1p and
Nup2p.
0
0
10
20
30
40
50
60
Location in document (%)
70
80
90
100
Figure 2: Location of information extracted from full
length papers. Location 0% is the start of the paper;
BioRAT incorrectly predicted that Cse1p interacted location 100% is the end. The peak on the left correwith Nup2p, whereas the text is less conclusive.
sponds to the abstract; the larger peak in the middle
corresponds to the results and discussion sections of the
source papers.
4.5 Speed and memory
The time it takes BioRAT to analyse a piece of text depends on the size of the text, the size of gazetteers, and
the complexity of the templates. In the work described
here, BioRAT typically took 3-5 seconds to analyse each
abstract, and 6-10 minutes to analyse each full-length
paper, running on a standard desktop PC (a single 1.7
GHz CPU), and used around 500Mb of RAM. Given
that each paper can be analysed independently, largescale applications of text mining lend themselves well
to distributed processing, although we have yet to use
10 PMID
11 PMID
Figure 2 is one view of information density. It shows
the location of each fact extracted from the set of fulllength papers used earlier. Because different journals
divide papers into sections in different ways, we only
consider the location of the information relative to the
entire paper. The peak on the left shows that a lot of information is found at the start of the paper, corresponding approximately to the title and abstract. The dip in
the graph around 10-30% shows that relatively little information is extracted from the next section, typically
the introduction and methods sections. There is another larger peak around 40-80%, corresponding to the
8921390
11689698
7
results and discussion sections, which contain a large
amount of relevant information, before tailing off towards the end of the paper, which is typically a citation list. Note that many interactions were found more
than once, through repetition within or between papers,
and the graph shows the location of all the extracted
information, including duplicates.
These peaks show from where most of the information
has been extracted, but the troughs are also of interest.
Even the least informative parts of papers still contain
considerable amounts of information. This strongly suggests that the entire paper should be analysed, wherever
possible, and not just a few selected sections. Although
the task is different, this contrasts with the behaviour of
some of the teams described by Yeh et al. (2003), who
restricted analysis to certain sections of the papers.
Even when BioRAT (or any other IE system) fails to
find a particular relationship, or incorrectly predicts a
relationship not mentioned in the text, it is quite possible that it has found an interesting part of an interesting
document. In this way, using IE to guide a literature
search is perfectly feasible, even if the recall and precision are a long way from the ideal 100%.
The template design tool allows biological researchers,
with no text mining experience, to design, test and use
a sophisticated template-based information extraction
system. This flexibility allows BioRAT to be applied
to a wide range of problems without a large overhead,
in contrast to many comparable systems, which require
both biological and text mining expertise for them to be
used fully.
The results that BioRAT produces can be stored and
retrieved using a variety of interfaces, easing the user’s
access to the information. Furthermore, BioRAT also
provides quotes from the source texts, and links directly
to the source papers and related databases. In this way,
BioRAT behaves like a virtual research assistant, guiding the user towards interesting papers.
6
converting them into a usable plain text format. However, these costs are outweighed by the fact that more
than twice as much relevant information can then be
extracted automatically.
Acknowledgments
This work was sponsored by GlaxoSmithKline.
References
Blaschke, C. & Valencia, A. (2001), ‘Can bibliographic pointers for known biological data be found automatically?
Protein interactions as a case study’, Comparative and
Functional Genomics 2, 196–206.
Blaschke, C. & Valencia, A. (2002), ‘The frame-based module
of the SUISEKI information extraction system’, IEEE
Intelligent Systems 17(2), 14–20.
Collier, R. (1998), Automatic template creation for information extraction, PhD thesis, Department of Computer
Science, University of Sheffield.
Cowie, J. & Wilks, Y. (2000), Information extraction, in
R. Dale, H. Moisl & H. Somers, eds, ‘Handbook of Natural Language Processing’, Marcel Dekker, New York.
Craven, M. & Kumlien, J. (1999), Constructing biological
knowledge-bases by extracting information from text
sources, in ‘Proceedings of the Seventh International
Conference on Intelligent Systems for Molecular Biology’, Germany, pp. 77–86.
Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V.
(2002), GATE: A framework and graphical development
environment for robust NLP tools and applications, in
‘Proceedings of the 40th Anniversary Meeting of the
Association for Computational Linguistics (ACL’02)’,
Philadelphia, USA.
Gaizauskas, R., Demetriou, G., Artymiuk, P. & Willett, P.
(2003), ‘Protein structures and information extraction
from biological texts: The PASTA system’, Journal of
Bioinformatics 19(1), 135–143.
Murray-Rust, P. & Rzepa, H. S. (2002), ‘STMML. A markup
language for scientific, technical and medical publishing’, Data Science 1(2), 1–65.
Conclusions
In this paper, we have presented BioRAT, an informa- Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. (2000), Automatic extraction of protein intertion extraction system specially designed to process biactions from scientific abstracts, in ‘Pacific Symposium
ological research papers. A distinguishing feature of
on Biocomputing 5’, pp. 538–549.
BioRAT is that it uses full-length papers, rather than
being limited to abstracts as previous studies have been. Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S. &
Eisenberg, D. (2002), ‘DIP: The Database of InteractThe recall and precision performance of BioRAT was
ing Proteins. A research tool for studying cellular netassessed by use of the DIP database of protein-protein
works of protein interactions’, Nucleic Acids Research
interactions, and the recall was compared with that of
30(1), 303–305.
a previous system, SUISEKI, which processed only the
abstracts. The recall performance of BioRAT on the Yeh, A., Hirschman, L. & Morgan, A. (2003), ‘Evaluation of text data mining for database curation: lessons
abstracts alone (20%) was similar to that of SUISEKI.
learned from the KDD Challenge Cup’, Bioinformatics
Overall, BioRAT achieved 43% recall and over 50% pre19(Suppl 1), i331–i339.
cision on full-length papers. Extra time is required to
obtain the full-length papers, and there are difficulties in
8