Opener: Open Polarity Enhanced Named Entity Recognition
Opener: Open Polarity Enhanced Named Entity Recognition
Opener: Open Polarity Enhanced Named Entity Recognition
Named Entity Disambiguation or NED) to and Sentiment Analysis through the reuse of
feed the feature-based opinion mining sys- existing resources and the open development
tems with relevant information with respect of complementary technologies. The key ob-
to the entities about which the opinions are jectives of the project are the following: (i)
being expressed. Repurposing and/or developing of existing
Currently there are a many companies of- language resources and generation of a ref-
fering Content Analytics and Social Internet erence generic multilingual sentiment lexicon
Mining services for the purposes of Opinion with cultural normalisation and scales; (ii)
Mining and Reputation Management. A high An extension lexicon for the Tourist sector in
percentage of Small and Medium Enterprises several languages (Spanish, Dutch, German,
(SMEs) are active offering niche solutions to Italian, English and French); (iii) Named En-
specific segments of the market and/or do- tity Resolution (NERC, NED and Corefer-
mains. However, acquiring or developing the ence) in the same set of target languages as
base qualifying technologies required to en- the Sentiment Lexicon which is extensible to
ter the market is an expensive undertaking other languages by leveraging multilingual re-
that redirects the already limited resources sources such as Wikipedia and Linked Data3
of SMEs away from offering the products and resources such as DBpedia4 , etc; (iv) Devel-
services that the market demands. opment and open availability of validated ref-
The main goal of the OpeNER project2 erence Sentiment and Opinion Mining tech-
is the reuse and repurposing of existing lan- niques and tools based on the results of the
guage resources and data sets to provide a project; (v) Evaluation and Application of
set of underlying technologies to the broader the project results in the cloud, principally in
community. OpeNER will focus on the pro- the tourism sector, with leading SMEs in the
vision of a supplementary sentiment lexi- sector and with the support of several stake-
con with culturally normalised and graduated holders as part of the End User Advisory
values. NERC will also be addressed lever- Board (EUAB); (vi)Research and trialling of
aging Linked Data with the aim of disam- models that will ensure that the project re-
biguating the entity types recognised for the sults are self-sustainable and economically vi-
languages covered in the project: Spanish, able in the long term; (vii) Achievement of
English, French, German, Dutch and Italian. the projects objectives by repurposing and
In the first year the project will be focused leveraging existing state of the art and es-
on a generic application domain, and later, tablished language resources.
adapted to the Tourism domain.
This will be achieved in conjunction with 3 Work Plan
and End User Advisory Board (EUAB) com- In order to optimise the value of OpeNER
posed of European Tourism Promotion Agen- technology, all the requirements along the
cies, an online Tourism Portal and other value chain for the development and the ex-
interested parties. Furthermore, OpeNER ploitation of the project’s objectives are di-
will employ proven software from the Open rectly represented in the project’s Consor-
Source community and develop an online tium, formed by 6 institutions from Italy, The
development community thus ensuring long Netherlands and Spain, with Vicomtech-IK4
term viability beyond the project timeframe. as coordinator. The OpeNER Work Plan is
In that way the benefits of the project will structured in 8 Work Packages (WP), and
be adopted and extended to new domains can be divided in three blocks. Although we
and languages, OpeNER will strive to make first describe every WP we will henceforth fo-
the tools and techniques resulting from the cus on the most relevant aspects to SEPLN,
project available under Open Source or Hy- namely, those related to work packages 4-7:
brid Licenses.
1. Management, Dissemination and
2 Objectives Exploitation: WP1 and WP8 led by
Vicomtech-IK4. As this is an SME ori-
OpeNER aims to provide enterprise and so- ented project, the Dissemination and
ciety with base technologies for Cross-lingual Exploitation of results will go beyond
Named Entity Recognition and Classification
3
http://linkeddata.org
2 4
http://www.opener-project.org http://dbpedia.org
216
OpeNER: Open Polarity Enhanced Named Entity Recognition
hotel reviews we will be looking at features box’ usable language analysis tool chains for
such as room service, cleanliness, etc. six languages to make sense out of unstruc-
As we are working with 6 languages, it tured text via Natural Language Processing.
would be convenient, where possible, to use These chains will easily be incorporated by
one solution, one tool and one methodol- SMEs in their workflow for applications such
ogy to provide most of the NLP annotation, as Reputation Management and Information
including not only NERC, Coreference and Access. Furthermore, on the second year of
NED, but also language identification, to- the project the toolchain will be adapted to
kenisation, POS tagging, lemmatisation, and process texts from the Tourist domain. To
parsing. Otherwise, maintaining so many this purpose, the project will also investi-
different tools for every annotation process gate how the performance of the OpeNER
would be far too cumbersome to provide toolchains can be improved by inter-relating
easy-to-use integrated NLP pipelines in a vir- with each other the various layers of linguistic
tual machine. Thus, every NLP component annotation.
(except Opinion Mining) in the English and
Spanish pipelines are being developed using Acknowledgments
the Apache OpenNLP API7 for supervised The research leading to these results has
Machine Learning based linguistic annota- received funding from the European Re-
tors: Sentence Segmentation, Tokenisation, search Council under the European Union’s
Part of Speech tagging, NERC and Con- Seventh Framework Programme (FP/2007-
stituent Parsing. The Consortium is train- 2013)/ERC Grant Agreement n. 296451
ing new models for every component using
the usual general domain datasets such as References
CoNLL and Evalita datasets for NERC, Penn Bosma, W., P. Vossen, A. Soroa, G. Rigau,
Treebank WSJ for English POS and Con- M. Tesconi, A. Marchetti, M. Monachini, and
stituent Parsing, Ancora8 for Spanish POS C. Aliprandi. 2009. Kaf: a generic semantic
and Constituent Parsing. Furthermore, lem- annotation format. In Proceedings of the Gen-
matisation is performed using word form and erative Lexicon (GL2009) Workshop on Se-
mantic Annotation, Pisa, Italy.
POS tags lookups in a dictionary for each
language. Gretzel, Ulrike and Kyung Hyan Yoo. 2008. Use
With respect to coreference, the Stanford and impact of online travel reviews. In In-
formation and communication technologies in
multi-sieve pass system (Lee et al., 2013)
tourism 2008. Springer, page 3546.
is being re-implemented in such a manner
that it facilitates its adaptation to other lan- Hu, M. and B. Liu. 2004. Mining and sum-
marizing customer reviews. In Proceedings of
guages. The coreference module takes KAF
the tenth ACM SIGKDD international confer-
containing POS and NERC annotated tokens ence on Knowledge discovery and data mining,
and a constituent parse tree as input. The page 168177.
sieves are implemented in a way that the only
Lee, Heeyoung, Angel Chang, Yves Peirsman,
requirements to adapt to another language is Nathanael Chambers, Mihai Surdeanu, and
to change the POS and Parsing tagsets and Dan Jurafsky. 2013. Deterministic coref-
a number of static dictionaries that contains erence resolution based on entity-centric,
information such as demonyms, gender, num- precision-ranked rules. Computational Lin-
ber, etc. Finally, the NED systems are be- guistics, pages 1–54, January.
ing adopted from the English DBpedia Spot- Mendes, P. N., M. Jakob, A. Garca-Silva, and
light9 (Mendes et al., 2011) which is based C. Bizer. 2011. Dbpedia spotlight: Shedding
on DBpedia and Wikipedia for disambigua- light on the web of documents. In Proceedings
tion of Named Entities. of the 7th International Conference on Seman-
tic Systems, page 18.
5 Concluding Remarks Pang, B. and L. Lee. 2008. Opinion mining and
This paper presents OpeNER, a European sentiment analysis. Foundations and Trends
in Information Retrieval, 2(1-2):1135.
project that will provide completely ‘off-the-
7
http://opennlp.apache.org/
8
http://clic.ub.edu/corpus/es/ancora
9
https://github.com/dbpedia-spotlight/dbpedia-
spotlight/wiki
218