Efficient and Accurate Entity Recognition for Biomedical Text
Fabio Rinaldi1, Lenz Furrer1, Marco Basaldella2
1 University of Zurich, 2 Università degli Studi di Udine
II. DATA
Abstract- This short paper briefly presents an efficient
implementation of a named entity recognition system for
biomedical entities, which is also available as a web service. The
approach is based on a dictionary-based entity recognizer
combined with a machine-learning classifier which acts as a filter.
We evaluated the efficiency of the ap-proach through participation
in the TIPS challenge (BioCreative V.5), where it obtained the best
re-sults among participating systems. We separately evaluated the
quality of entity recognition and link-ing, using a manually
annotated corpus as a reference (CRAFT), where we obtained
state-of-the-art results.
The Colorado Richly Annotated Full Text (CRAFT) cor-pus [2,
16] has been built specifically for evaluating these kinds of
systems. It consists of 67 full-text articles that have been
manually annotated with respect to chemicals, genes, proteins,
cell types, cellular components, biological processes, molecular
functions, organisms, and biological sequences. In total, the
available articles are annotated with over 100,000 concepts.
For our experiments, we used all terminology resources
that were distributed with the corpus (which means all annotated entities, except those grounded using Entrez Gene). We
Keywords- named entity recognition; text mining; machine
regarded species and higher taxonomic ranks (genus, or-der,
learning; natural language processing.
phylum etc.) from both cellular organisms and viruses as a
common entity type “organism”. Also, we combined the two
non-physical entity types (biological processes and molecular
I. INTRODUCTION
functions) into a single class.
Named entity recognition is most often tackled with knowledgebased approaches (using dictionaries) or example-based
approaches (machine learning). Currently the best results are
obtained using supervised machine-learning based systems. For
extracting chemical names, [9] describes how two CRF
classifiers are trained on a corpus of journal abstracts, using
different features and model parameters. The approach in [10]
also tackles chemical name extraction with CRF, partly using
the same software basis as the previous one. For tagging gene
names, [14] describes another supervised sequence-labeling
approach, using a CRF classifier.
There is growing interest in hybrid systems combining
machine learning and dictionary approaches such as the one
described in [1], which obtains interesting performance on
chemical entity recognition in patent texts.
In the field of entity linking, dictionary-based methods are
predominant, since the prediction of arbitrary identi-fiers cannot
be modeled in a generalized way. In [6], the authors explore
ways to improve established information retrieval techniques for
matching protein names and other biochemical entities against
ontological resources. The TaggerOne system [8] uses a joint
model for tackling NER and linking at the same time – yet
another example of a hybrid system that combines machine
learning and dictionaries.
III. METHODS
The OntoGene group has developed an approach for biomedical
entity recognition based on dictionary lookup and flexible
matching. Their approach has been used in several competitive
evaluations of biomedical text mining technologies, often
obtaining top-ranked results [12, 13, 11]. Recently, the core
parts of the pipeline have been imple-mented in a more efficient
framework using Python [4] and are now developed under the
name OGER (OntoGene’s En-tity Recognizer). These
improvements showed to be effec-tive in the BioCreative V.5
shared task [7]: in the techni-cal interoperability and
performance of annotation servers (TIPS) task, our system
achieved best results in four out of six evaluation metrics [5].
OGER offers a flexible web API for performing dictionarybased NER. It accepts a range of input formats and provides the
annotated terms along with identifiers in various output formats.
We run an instance of OGER as a permanent web service
which is accessible through an API and a web user interface.1
For the experiments with the CRAFT corpus, we used the
ontologies on which the original annotation was based. We use
those resources to compile a non-hierarchical dictionary with
1.26 million terms pointing to 864,000 concept identifiers.
The input documents were tokenized with a simple method
based on character class, which collapsed spelling variants such
as “SRC 1”, “SRC-1”, and “SRC1” to a common form.
1 https://pub.cl.uzh.ch/projects/ontogene/oger/
1
B. Concept Recognition We chose a simple strategy to
TABLE 1: Performance of our system in entity recognition
reintroduce the concept identifiers provided by OGER into the
(top) and entity linking (a.k.a. concept recognition, botoutput of the ML systems, based on the intersection of the
tom), compared to the best results reported in [15].
original annotations from OGER’s output (which in-clude
identifiers) and the annotations left after applying by the NNSystem
Precision Recall
F1
based
filter. We did not resolve ambiguous an-notations; instead,
OGER
0.59
0.66
0.62
multiple
identifiers could be returned for the same span. While
OGER+NN
0.86
0.60
0.70
having no disambiguation at all is arguably a deficiency for an
OGER
0.32
0.52
0.40
entity linking system, it is not imperative that each and every
OGER+NN
0.51
0.49
0.50
ambiguity is reduced to a single choice. This is particularly true
cTakes Dict. Lookup 0.51
0.43
0.47
when evaluating against CRAFT, which contains a number of
reference an-notations with multiple concept identifiers. For
example, in PMID: 16504143, PMCID: 1420314, the term
All tokens were then converted to low-ercase and stemmed, “fish” (oc-curring in the last paragraph of the Discussion
except for acronyms that collide with a word from general section) is assigned six different taxonomic ranks.
language (e.g. “WAS”). As a further normalization step, Greek
This simple strategy allows the system to reach a preletters were expanded to their letter name in Latin spelling, e.g. cision of 51% with a recall of 49% in concept recognition.
“α” → “alpha”.
Compared with the results of several previous systems reIn order to improve the system’s accuracy, we added a ported in [15], who carried out a series of experiments using
machine-learning filter to remove spurious matches. We used an the same dataset, our results are already state-of-the-art.
approach based on neural networks (NN), as they were the best
Please note that the results reported by [15] are not per-fectly
performing algorithm in our previous exper-iments described in comparable to the ones we obtained, since the former were
[3]. Training is performed using 10-fold cross validation on 47 tested on the whole CRAFT corpus, while our approach was
articles; the evaluation is thus performed on 20 documents only. evaluated on 20 documents only (since we used the remaining
The features used are mostly shape-based (character count, documents to train our system) Still, the comparison shows that
capitalization), but some include linguistic information (POS, even a relatively simple approach is sufficient to transform our
stem) or domain knowledge (frequent pre-/suffixes).
NER pipeline into an entity linking system with reasonable
quality. This is particularly true for the OGER-NN
configuration, where both precision and recall are as good as or
IV. RESULTS
better than the figures for all the reported systems.
We examined our system in two separate evaluations. We first
considered the performance of NER proper, i.e. we re-garded
only offset spans and the (coarse) entity type of each annotation
produced by each system, ignoring con-cept identifiers. We then
evaluated the correctness of the selected concept identifiers. To
this end, we augmented the ML-based output with concept
identifiers taken from the dictionary-based pre-annotations,
which enabled us to draw a fair comparison to previous work in
entity linking on the CRAFT corpus.
V. CONCLUSION AND FUTURE WORK
In this paper, we presented an efficient, high-quality system for
biomedical entity recognition and linking (OGER). We
evaluated both processing speed and annotation quality in a
series of in-domain experiments using the CRAFT corpus.
OGER’s scalability and efficiency was also demonstrated in the
recently held TIPS task of the BioCreative V.5 chal-lenge. For
the NER performance, we used a NN classifier, which acted as a
postfilter of the dictionary annotations. The combined system
achieved competitive results in en-tity recognition and state-ofthe-art results in entity linking over the selected evaluation
corpus (CRAFT).
Currently we expose via web API only the OGER service
(entity recognition and linking) but without disambigua-tion.
As a next step in this research activity, we intend to make
available a second web service including disam-biguation.
At the same time we are performing additional experiments
aimed at improving the quality of the disam-biguation step.
A. Named Entity Recognition We have compiled very de-tailed
results for different configurations and for each en-tity category,
however the brevity of this paper allows us to present only
aggregated results. The OGER pipeline alone (without filtering)
delivers an overall 66% recall score with a precision of 59%
over all the entity types consid-ered. Adding the NN-based
filtering module, recall drops to 60%, with an increase in
precision to 86%, leading to a very competitive F-score of 70%.
2
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[11] Fabio Rinaldi, Simon Clematide, Hernani Marques,
Tilia Ellendorff, Raul Rodriguez-Esteban, and Martin
Saber A Akhondi, Ewoud Pons, Zubair Afzal, Herman
Romacker. OntoGene web services for biomedical text
van Haagen, Benedikt FH Becker, Kristina M Hettne,
mining. BMC Bioinformatics, 15(14), 2014.
Erik M van Mulligen, and Jan A Kors. Chemical entity
recognition in patents by combining dictionary-based [12] Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand,
Gerold Schneider, Manfred Klenner, Simon Clematide,
and statistical approaches. Database, 2016:baw061,
Michael Hess, Jean-Marc von Allmen, Pierre Parisot,
2016.
Martin Romacker, and Therese Vachon. OntoGene in
Michael Bada, Miriam Eckert, Donald Evans, Kristin
BioCreative II. Genome Biology, 9(Suppl 2):S13, 2008.
Garcia, Krista Shipley, Dmitry Sitnikov, William A
Baumgartner, K Bretonnel Cohen, Karin Verspoor, [13] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand,
Simon Clematide, Therese Vachon, and Martin RoJudith A Blake, et al. Concept annotation in the
macker. OntoGene in BioCreative II.5. IEEE/ACM
CRAFT corpus. BMC bioinformatics, 13(1):1, 2012.
Transactions on Computational Biology and BioinforMarco Basaldella, Lenz Furrer, Nico Colic, Tilia R Elmatics, 7(3):472–480, 2010.
lendorff, Carlo Tasso, and Fabio Rinaldi. Using a hybrid approach for entity recognition in the biomedical [14] Golnar Sheikhshab, Elizabeth Starks, Aly Karsan,
domain. In Proceedings of the 7th International SymAnoop Sarkar, and Inanc Birol. Graph-based semiposium on Semantic Mining in Biomedicine (SMBM
supervised gene mention tagging. In Proceedings of
2016), 2016.
the 15th Workshop on Biomedical Natural Language
Processing, pages 27–35, Berlin, Germany, 2016. AssoNicola Colic. Dependency parsing for relation extracciation for Computational Linguistics.
tion in biomedical literature. Master’s thesis, University of Zurich, Switzerland, 2016.
[15] Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski,
Julia Corrigan, Girish Chavan, and Rebecca S JacobLenz Furrer and Fabio Rinaldi. OGER: OntoGene’s
son. NOBLE – Flexible concept recognition for largeentity recogniser in the BeCalm TIPS task. In Proscale biomedical natural language processing. BMC
ceedings of the BioCreative V.5 Challenge Evaluation
bioinformatics, 17(1):1, 2016.
Workshop, pages 175–182, 2017.
Tudor Groza and Karin Verspoor. Assessing the [16] Karin Verspoor, Kevin Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L. Johnson, Christophe
impact of case sensitivity and term information
Roeder, Jinho D. Choi, Christopher Funk, Yuriy
gain on biomedical concept recognition. PloS one,
Malenkiy, Miriam Eckert, Nianwen Xue, William A.
10(3):e0119091, 2015.
Baumgartner, Michael Bada, Martha Palmer, and
Martin Krallinger, Martin Pérez-Pérez, Gael PérezLawrence E. Hunter. A corpus of full-text journal artiRodrı́guez, Aitor Blanco-Mı́guez, Florentino Fdezcles is a robust evaluation tool for revealing differences
Riverola,
Salvador Cappella-Gutierrez,
Anália
in performance of biomedical natural language processLourenço, and Alfonso Valencia. The BioCreative
ing tools. BMC Bioinformatics, 13(1):207, 2012.
V.5/BeCalm evaluation workshop: tasks, organization, sessions and topics. In Proceedings of the
BioCreative V.5 Challenge Evaluation Workshop,
pages 8–10, 2017.
[8] Robert Leaman and Zhiyong Lu. TaggerOne: joint
named entity recognition and normalization with semiMarkov Models. Bioinformatics, 32(18):2839, 2016.
[9] Robert Leaman, Chih-Hsuan Wei, and Zhiyong Lu.
tmChem: a high performance approach for chemical
named entity recognition and normalization. J. Cheminformatics, 7(S-1):S3, 2015.
[10] Tsendsuren Munkhdalai, Meijing Li, Khuyagbaatar
Batsuren, Hyeon Ah Park, Nak Hyeon Choi, and
Keun Ho Ryu. Incorporating domain knowledge in
chemical and biomedical named entity recognition with
word representations. Journal of Cheminformatics,
7(1):S9, 2015.
3