Opener: Open Polarity Enhanced Named Entity Recognition

Procesamiento del Lenguaje Natural, Revista nº 51, septiembre de 2013, pp 215-218 recibido 30-04-2013 revisado 16-06-2013 aceptado 21-06-2013
OpeNER: Open Polarity Enhanced Named Entity Recognition

OpeNER: Reconocimiento de entidades nombradas con polaridad
Rodrigo Agerri Montse Cuadros Seán Gaines German Rigau
IXA NLP Group HSLT, IP department IXA NLP Group
UPV/EHU Vicomtech-IK4 UPV/EHU
[email protected] {mcuadros,sgaines}@vicomtech.org [email protected]
Resumen: Actualmente existe una gran cantidad de empresas ofreciendo servi-

cios para el análisis de contenido y minerı́a de de datos de las redes sociales con el
objetivo de realizar análisis de opiniones y gestión de la reputación. Un alto por-
centaje de pequeñas y medianas empresas (pymes) ofrecen soluciontes especı́ficas
a un sector o dominio industrial. Sin embargo, la adquisición de la necesaria tec-
nologı́a básica para ofrecer tales servicios es demasiado compleja y constituye un
sobrecoste demasiado alto para sus limitados recursos. El objetivo del proyecto eu-
ropeo OpeNER es la reutilización y desarrollo de componentes y recursos para el
procesamiento lingüı́stico que proporcione la tecnologı́a necesaria para su uso indus-
trial y/o académico.
Palabras clave: Reconocimiento y Desambiguación de Entidades Nombradas,
Coreferencia, Análisis de Sentimiento
Abstract: Currently there are a many companies offering Content Analytics and
Social Internet Mining services for the purposes of Opinion Mining and Reputation
Management. A high percentage of Small and Medium Enterprises (SMEs) are
active offering niche solutions to specific segments of the market and/or domains.
However, acquiring or developing the base qualifying technologies required to enter
the market is an expensive undertaking that redirects the already limited resources
of SMEs away from offering products and services that the market demands.The
main goal of the OpeNER european project is the reuse and repurposing of existing
language resources and data sets to provide a set of underlying technologies to the
broader industrial and academic community.
Keywords: Named Entity Recognition and Disambiguation, Coreference, Opinion
Mining
1 Introduction and access to Information and Communica-
Customer reviews and ratings on the Internet tion Technologies (ICT). Consumers tend to
are increasing importance in the evaluation of trust the opinion of other consumers, espe-
products and services by potential customers. cially those with prior experience of a prod-
In certain sectors, it is even becoming a fun- uct or service, rather than company market-
damental variable in the purchase decision. A ing (see footnote 1). The role of user com-
recent Forrester study showed more than 30% ments is of particular importance when there
of Internet users have evaluated products is little differentiation between the product
online, and that 70% of those studied end and services on offer. Therefore, there is
1
user generated reviews . Furthermore, an- an objective necessity to manage and under-
other study by Complete Incorporated for the stand the knowledge conveyed by opinions.
Tourist Domain showed that more than 80% Opinion Mining consists of extracting and
of users preferred other users’ opinions in or- analysing, from unstructured text, opinions
der to make their buying decisions. In fact, about products, people, events, institutions,
it has been concluded that 97% of Internet etc. (Pang and Lee, 2008). In other words,
users have read and been influenced by other the goal is to know “who” is speaking about
users’ opinions while planning a trip (Gret- “what”, “when” and in “what sense” (Hu
zel and Yoo, 2008). Obviously, this trend and Liu, 2004). More specifically, OpeNER
will continue with the growth of Social Media will stress the importance of providing a good
Name Entity Resolution system (Named En-
1
http://www.bazaarvoice.com/resources/stats tity Recognition or NERC, Coreference and
ISSN 1135-5948 © 2013 Sociedad Española Para el Procesamiento del Lenguaje Natural
Rodrigo Agerri, Montse Cuadros, Seán Gaines y German Rigau
Named Entity Disambiguation or NED) to and Sentiment Analysis through the reuse of
feed the feature-based opinion mining sys- existing resources and the open development
tems with relevant information with respect of complementary technologies. The key ob-
to the entities about which the opinions are jectives of the project are the following: (i)
being expressed. Repurposing and/or developing of existing
Currently there are a many companies of- language resources and generation of a ref-
fering Content Analytics and Social Internet erence generic multilingual sentiment lexicon
Mining services for the purposes of Opinion with cultural normalisation and scales; (ii)
Mining and Reputation Management. A high An extension lexicon for the Tourist sector in
percentage of Small and Medium Enterprises several languages (Spanish, Dutch, German,
(SMEs) are active offering niche solutions to Italian, English and French); (iii) Named En-
specific segments of the market and/or do- tity Resolution (NERC, NED and Corefer-
mains. However, acquiring or developing the ence) in the same set of target languages as
base qualifying technologies required to en- the Sentiment Lexicon which is extensible to
ter the market is an expensive undertaking other languages by leveraging multilingual re-
that redirects the already limited resources sources such as Wikipedia and Linked Data3
of SMEs away from offering the products and resources such as DBpedia4 , etc; (iv) Devel-
services that the market demands. opment and open availability of validated ref-
The main goal of the OpeNER project2 erence Sentiment and Opinion Mining tech-
is the reuse and repurposing of existing lan- niques and tools based on the results of the
guage resources and data sets to provide a project; (v) Evaluation and Application of
set of underlying technologies to the broader the project results in the cloud, principally in
community. OpeNER will focus on the pro- the tourism sector, with leading SMEs in the
vision of a supplementary sentiment lexi- sector and with the support of several stake-
con with culturally normalised and graduated holders as part of the End User Advisory
values. NERC will also be addressed lever- Board (EUAB); (vi)Research and trialling of
aging Linked Data with the aim of disam- models that will ensure that the project re-
biguating the entity types recognised for the sults are self-sustainable and economically vi-
languages covered in the project: Spanish, able in the long term; (vii) Achievement of
English, French, German, Dutch and Italian. the projects objectives by repurposing and
In the first year the project will be focused leveraging existing state of the art and es-
on a generic application domain, and later, tablished language resources.
adapted to the Tourism domain.
This will be achieved in conjunction with 3 Work Plan
and End User Advisory Board (EUAB) com- In order to optimise the value of OpeNER
posed of European Tourism Promotion Agen- technology, all the requirements along the
cies, an online Tourism Portal and other value chain for the development and the ex-
interested parties. Furthermore, OpeNER ploitation of the project’s objectives are di-
will employ proven software from the Open rectly represented in the project’s Consor-
Source community and develop an online tium, formed by 6 institutions from Italy, The
development community thus ensuring long Netherlands and Spain, with Vicomtech-IK4
term viability beyond the project timeframe. as coordinator. The OpeNER Work Plan is
In that way the benefits of the project will structured in 8 Work Packages (WP), and
be adopted and extended to new domains can be divided in three blocks. Although we
and languages, OpeNER will strive to make first describe every WP we will henceforth fo-
the tools and techniques resulting from the cus on the most relevant aspects to SEPLN,
project available under Open Source or Hy- namely, those related to work packages 4-7:
brid Licenses.
1. Management, Dissemination and
2 Objectives Exploitation: WP1 and WP8 led by
Vicomtech-IK4. As this is an SME ori-
OpeNER aims to provide enterprise and so- ented project, the Dissemination and
ciety with base technologies for Cross-lingual Exploitation of results will go beyond
Named Entity Recognition and Classification
3
http://linkeddata.org
2 4
http://www.opener-project.org http://dbpedia.org
216
OpeNER: Open Polarity Enhanced Named Entity Recognition
the publication of scientific articles. It ponents themselves. The OpeNER architec-

shall include industrial dissemination ture consists of several building blocks called
and exploitation also. components, which can be used to build a tool
chain called a configuration,
2. System Design and Deployment:
A component consists of a kernel which
WP2 and WP7 led by Synthema and
can be for example a POS tagger imple-
Olery respectively. In order to ensure
mented in Java and a glue in Ruby to connect
the future exploitation of the project by
with other components. Figure 1 represents a
SMEs, the system design and deploy-
possible flow of information between several
ment is crucial. Both Synthema and
components. Each component is configured
Olery have experience in software inte-
to take the information it requires to perform
gration for industry related applications
a specific analysis from the previous module.
and products.
KAF(Bosma et al., 2009)5 is used as inter-
3. NLP and Web techniques: WP3 component representation between the com-
(Universidad del Paı́s Vasco/Euskal Her- ponents. Each of the tool chains built are
riko Unibertsitatea, UPV/EHU), WP4 then deployed via Cloud Computing services
(Consejo Nacional de Investigación de such as Amazon Elastic Computing Cloud6
Italia, CNR), WP5 (Universidad Li- (Amazon EC2).
bre de Amsterdam, VUA) and WP6 As described in section 3, the NLP fo-
(Vicomtech-IK4). Focused on Opinion cus of OpeNER is on Named Entity Reso-
Mining (WP5) and Named Entity Reso- lution (NERC, Coreference and NED) and
lution (WP3) and any other basic NLP Opinion Mining. The overall objective of
(WP6) and Web tools (WP4) required Named Entity Resolution is to be able to
to perform those tasks. recognise, classify and link every mention of a
specific named entity in a text. A named en-
OpeNER will provide language analysis tity can be mentioned using a great variety
tool chains for several languages to help re- of surface forms (Barack Obama, President
searchers and companies make sense out of Obama, Mr. Obama, B. Obama, etc.) and
unstructured text via Natural Language Pro- the same surface form can refer to a variety
cessing. It will consist of easy to install, im- of named entities: for example, the form ‘San
prove and configure components to: (i) De- Juan’ can be used to ambiguously refer to
tect the language of a text; (ii) Determine po- many toponyms, persons, a saint, etc. (e.g.
larity of texts (sentiment analysis) and analy- see http://en.wikipedia.org/wiki/San Juan).
sis of feature-based opinions; and (iii) Detect Furthermore, it is possible to refer to a
and classify the entities named in the texts named entity by means of anaphoric pro-
and link them together via Named Entity nouns and coreferent expressions such as ‘he’,
Recognition, Coreference and Named Entity ‘her’, ‘their’, ‘I’, ‘the 35 year old’, etc. There-
Disambiguation (e.g. President Obama or fore, in order to provide an adequate com-
The Hilton Hotel). Besides the individual prehensive account of named-entities in text
components, guidelines exists on how to add it is necessary to recognise the mention of
languages and how to adjust components for a named-entity, to classify it as a type (e.g.
specific situations and topics. The following person, location, etc.), to link it or disam-
section will describe the English and Spanish biguate it to a specific entity, and to resolve
OpeNER toolchains. every form of mentioning or co-referring to
the same entity in a text.
4 OpeNER NLP Pipelines The Opinion Mining approach in
An OpeNER tool chain or pipeline consists OpeNER consists of three levels: (i) genera-
of a broad mix of technologies glued together tion of polarity lexicons from WordNets for
using Ruby. The prerequisites for running each language (ii) development of polarity
an OpeNER tool chain are the following: A systems at document and sentence level
GNU/Linux or Unix type operativing system based on those lexicons and (iii) feature-
(including MAC OS), Ruby 1.9.3+, Python based opinion mining based on supervised
2.7+, Java 1.7+, Perl 5+. Every part of the classification and feature extraction. For
OpeNER tool chain has individual dependen- 5
http://www.kyoto-project.eu
6
cies, most of which are included in the com- http://aws.amazon.com/ec2/
217
Rodrigo Agerri, Montse Cuadros, Seán Gaines y German Rigau
hotel reviews we will be looking at features box’ usable language analysis tool chains for
such as room service, cleanliness, etc. six languages to make sense out of unstruc-
As we are working with 6 languages, it tured text via Natural Language Processing.
would be convenient, where possible, to use These chains will easily be incorporated by
one solution, one tool and one methodol- SMEs in their workflow for applications such
ogy to provide most of the NLP annotation, as Reputation Management and Information
including not only NERC, Coreference and Access. Furthermore, on the second year of
NED, but also language identification, to- the project the toolchain will be adapted to
kenisation, POS tagging, lemmatisation, and process texts from the Tourist domain. To
parsing. Otherwise, maintaining so many this purpose, the project will also investi-
different tools for every annotation process gate how the performance of the OpeNER
would be far too cumbersome to provide toolchains can be improved by inter-relating
easy-to-use integrated NLP pipelines in a vir- with each other the various layers of linguistic
tual machine. Thus, every NLP component annotation.
(except Opinion Mining) in the English and
Spanish pipelines are being developed using Acknowledgments
the Apache OpenNLP API7 for supervised The research leading to these results has
Machine Learning based linguistic annota- received funding from the European Re-
tors: Sentence Segmentation, Tokenisation, search Council under the European Union’s
Part of Speech tagging, NERC and Con- Seventh Framework Programme (FP/2007-
stituent Parsing. The Consortium is train- 2013)/ERC Grant Agreement n. 296451
ing new models for every component using
the usual general domain datasets such as References
CoNLL and Evalita datasets for NERC, Penn Bosma, W., P. Vossen, A. Soroa, G. Rigau,
Treebank WSJ for English POS and Con- M. Tesconi, A. Marchetti, M. Monachini, and
stituent Parsing, Ancora8 for Spanish POS C. Aliprandi. 2009. Kaf: a generic semantic
and Constituent Parsing. Furthermore, lem- annotation format. In Proceedings of the Gen-
matisation is performed using word form and erative Lexicon (GL2009) Workshop on Se-
mantic Annotation, Pisa, Italy.
POS tags lookups in a dictionary for each
language. Gretzel, Ulrike and Kyung Hyan Yoo. 2008. Use
With respect to coreference, the Stanford and impact of online travel reviews. In In-
formation and communication technologies in
multi-sieve pass system (Lee et al., 2013)
tourism 2008. Springer, page 3546.
is being re-implemented in such a manner
that it facilitates its adaptation to other lan- Hu, M. and B. Liu. 2004. Mining and sum-
marizing customer reviews. In Proceedings of
guages. The coreference module takes KAF
the tenth ACM SIGKDD international confer-
containing POS and NERC annotated tokens ence on Knowledge discovery and data mining,
and a constituent parse tree as input. The page 168177.
sieves are implemented in a way that the only
Lee, Heeyoung, Angel Chang, Yves Peirsman,
requirements to adapt to another language is Nathanael Chambers, Mihai Surdeanu, and
to change the POS and Parsing tagsets and Dan Jurafsky. 2013. Deterministic coref-
a number of static dictionaries that contains erence resolution based on entity-centric,
information such as demonyms, gender, num- precision-ranked rules. Computational Lin-
ber, etc. Finally, the NED systems are be- guistics, pages 1–54, January.
ing adopted from the English DBpedia Spot- Mendes, P. N., M. Jakob, A. Garca-Silva, and
light9 (Mendes et al., 2011) which is based C. Bizer. 2011. Dbpedia spotlight: Shedding
on DBpedia and Wikipedia for disambigua- light on the web of documents. In Proceedings
tion of Named Entities. of the 7th International Conference on Seman-
tic Systems, page 18.
5 Concluding Remarks Pang, B. and L. Lee. 2008. Opinion mining and
This paper presents OpeNER, a European sentiment analysis. Foundations and Trends
in Information Retrieval, 2(1-2):1135.
project that will provide completely ‘off-the-
7
http://opennlp.apache.org/
8
http://clic.ub.edu/corpus/es/ancora
9
https://github.com/dbpedia-spotlight/dbpedia-
spotlight/wiki
218
View publication stats

Opener: Open Polarity Enhanced Named Entity Recognition

Uploaded by

Copyright:

Available Formats

Opener: Open Polarity Enhanced Named Entity Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Opener: Open Polarity Enhanced Named Entity Recognition

Uploaded by

Copyright:

Available Formats

Procesamiento del Lenguaje Natural, Revista nº 51, septiembre de 2013, pp 215-218 recibido 30-04-2013 revisado 16-06-2013 aceptado 21-06-2013

OpeNER: Open Polarity Enhanced Named Entity Recognition

Resumen: Actualmente existe una gran cantidad de empresas ofreciendo servi-

the publication of scientific articles. It ponents themselves. The OpeNER architec-

View publication stats

You might also like