AI-KU: Using Co-Occurrence Modeling for Semantic Similarity
Osman Başkaya
Artificial Intelligence Laboratory
Koç University, Istanbul, Turkey
[email protected]
Abstract
In this paper, we describe our unsupervised
method submitted to the Cross-Level Semantic Similarity task in Semeval 2014 that
computes semantic similarity between two
different sized text fragments. Our method
models each text fragment by using the cooccurrence statistics of either occurred words
or their substitutes. The co-occurrence modeling step provides dense, low-dimensional
embedding for each fragment which allows
us to calculate semantic similarity using
various similarity metrics. Although our
current model avoids the syntactic information, we achieved promising results and
outperformed all baselines.
1
Introduction
Semantic similarity is a measure that specifies the
similarity of one text’s meaning to another’s. Semantic similarity plays an important role in various Natural Language Processing (NLP) tasks such
as textual entailment (Berant et al., 2012), summarization (Lin and Hovy, 2003), question answering
(Surdeanu et al., 2011), text classification (Sebastiani, 2002), word sense disambiguation (Schütze,
1998) and information retrieval (Park et al., 2005).
There are three main approaches to computing
the semantic similarity between two text fragments.
The first approach uses Vector Space Models (see
Turney & Pantel (2010) for an overview) where
each text is represented as a bag-of-word model.
The similarity between two text fragments can then
be computed with various metrics such as cosine
similarity. Sparseness in the input nature is the
key problem for these models. Therefore, later
works such as Latent Semantic Indexing (?) and
This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details:
http://creativecommons.org/licenses/by/4.0/
Topic Models (Blei et al., 2003) overcome sparsity problems via reducing the dimensionality of
the model by introducing latent variables. The second approach blends various lexical and syntactic
features and attacks the problem through machine
learning models. The third approach is based on
word-to-word similarity alignment (Pilehvar et al.,
2013; Islam and Inkpen, 2008).
The Cross-Level Semantic Similarity (CLSS) task
in SemEval 20141 (Jurgens et al., 2014) provides
an evaluation framework to assess similarity methods for texts in different volumes (i.e., lexical levels). Unlike previous SemEval and *SEM tasks
that were interested in comparing texts with similar volume, this task consists of four subtasks (paragraph2sentence, sentence2phrase, phrase2word and
word2sense) that investigate the performance of
systems based on pairs of texts of different sizes.
A system should report the similarity score of a
given pair, ranging from 4 (two items have very
similar meanings and the most important ideas,
concepts, or actions in the larger text are represented in the smaller text) to 0 (two items do not
mean the same thing and are not on the same topic).
In this paper, we describe our two unsupervised
systems that are based on co-occurrence statistics
of words. The only difference between the systems is the input they use. The first system uses the
words directly (after lemmatization, stop-word removal and excluding the non-alphanumeric characters) in text while the second system utilizes the
most likely substitutes consulted by a 4-gram language model for each observed word position (i.e.,
context). Note that we participated two subtasks
which are paragraph2sentence and sentence2phrase.
The remainder of the paper proceeds as follows.
Section 2 explains the preprocessing part, the difference between the systems, co-occurrence modeling, and how we calculate the similarity between
1
http://alt.qcri.org/semeval2014/
task3/
92
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 92–96,
Dublin, Ireland, August 23-24, 2014.
Type-ID
Sent-33
Sent-33
Sent-33
Sent-33
Sent-33
Sent-33
Lemma
choose
buy
gift
card
hard
decision
two texts after co-occurrence modeling has been
done. Section 3 discusses the results of our systems and compares them to other participants’. Section 4 discusses the findings and concludes with
plans for future work.
AI-KU2 : Previously, the utilization of high probability substitutes and their co-occurrence statistics achieved notable performance on Word Sense
Induction (WSI) (Baskaya et al., 2013) and Partof-Speech Induction (Yatbaz et al., 2012) problems. AI-KU2 represents each context of a word
by finding the most likely 100 substitutes suggested
by the 4-gram language model we built from ukWaC4
(Ferraresi et al., 2008), a 2-billion word web-gathered
corpus. Since S-CODE algorithm works with discrete input, for each context we sample 100 substitute words with replacement using their probabilities. Table 2 illustrates the context and substitutes
of each context using a bigram language model.
No lemmatization, stop-word removal and lowercase transformation were performed.
2
2.2 Co-Occurrence Modeling
Table 1: Instance id-word pairs for a given sentence.
Algorithm
This subsection will explain the unsupervised method
we employed to model co-occurrence statistics: the
Co-occurrence data Embedding (CODE) method
(Globerson et al., 2007) and its spherical extension (S-CODE) proposed by Maron et al. (2010).
Unlike in our WSI work, where we ended up with
an embedding for each word in the co-occurrence
modeling step in this task, we model each text unit
such as a paragraph, a sentence or a phrase, to obtain embeddings for each instance.
Input data for S-CODE algorithm consist of instanceid and each word in the text unit for the first system (Table 1 illustrates the pairs for only one text
fragment) instance-ids and 100 substitute samples
of each word in text for the second system. In
the initial step, S-CODE puts all instance-ids and
words (or substitutes, depending on the system)
randomly on an n-dimensional sphere. If two different instances have the same word or substitute,
then these two instances attract one another — otherwise they repel each other. When S-CODE converges, instances that have similar words or substitutes will be closely located or else, they will be
distant from each other.
This section explains preprocessing steps of the
data and the details of our two systems2 . Both
systems rely on the co-occurrence statistics. The
slight difference between the two is that the first
one uses the words that occur in the given text
fragment (e.g., paragraph, sentence), whereas the
latter employs co-occurrence statistics on 100 substitute samples for each word within the given text
fragment.
2.1 Data Preprocessing
Two AI-KU systems can be distinguished by their
inputs. One uses the raw input words, whereas the
other uses words’ likely substitutes according to a
language model.
AI-KU1 : This system uses the words that were
in the text. All words are transformed into lowercase equivalents. Lemmatization3 and stop-word
removal were performed, and non-alphanumeric
characters were excluded. Table 1 displays the
pairs for the following sentence which is an instance from paragraph2sentence test set:
“Choosing what to buy with a $35 gift
card is a hard decision.”
AI-KU1 : According to the training set performances for various n (i.e., number of dimensions
for S-CODE algorithm), we picked 100 for both
tasks.
Note that the input that we used to model cooccurrence statistics consists of all such pairs for
each fragment in a given subtask.
AI-KU2 : We picked n to be 200 and 100 for
paragraph2sentence and sentence2phrase subtasks,
respectively.
2
The code to replicate our work can be found
at
https://github.com/osmanbaskaya/
semeval14-task3.
3
Lemmatization is carried out with Stanford CoreNLP
and transforms a word into its canonical or base form.
4
93
Available here: http://wacky.sslmit.unibo.it
Word
the
dog
bites
Context
<s>
dog
the
dog
.
Substitutes
The (0.12), A (0.11), If (0.02), As (0.07), Stray (0.001),..., wn (0.02)
cat (0.007), dog (0.005), animal (0.002), wolve (0.001), ..., wn (0.01)
runs (0.14), bites (0.13), catches (0.04), barks (0.001), ..., wn (0.01)
System
AI-KU1
AI-KU2
LCS
lch
lin
JI
Pearson
0.671
0.542
0.499
0.584
0.568
0.613
Spearman
0.676
0.531
0.602
0.596
0.562
0.644
Sentence-2-Phrase
Paragraph-2-Sentence
Table 2: Contexts and substitute distributions when a bigram language model is used. w and n denote an
arbitrary word in the vocabulary and the vocabulary size, respectively.
Table 3: Paragraph-2-Sentence subtask scores for
the training data. Subscripts in AI-KU systems
specify the run number.
System
AI-KU1
AI-KU2
LCS
lch
lin
JI
Pearson
0.607
0.620
0.500
0.484
0.492
0.465
Spearman
0.568
0.579
0.582
0.491
0.470
0.465
Table 4: Sentence2phrase subtask scores for the
training data.
3
Evaluation Results
Since this step is unsupervised, we tried to enrich the data with ukWaC, however, enrichment
with ukWaC did not work well on the training data.
To this end, proposed scores were obtained using
only the training and the test data provided by organizers.
Tables 3 and 4 show the scores for Paragraph-2Sentence and Sentence-2-Phrase subtasks on the
training data, respectively. These tables contain
the best individual scores for the performance metrics, Normalized Longest Common Substring (LCS)
baseline, which was given by task organizers, and
three additional baselines: lin (Lin, 1998), lch (Lea2.3 Similarity Calculation
When the S-CODE converges, there is an n-dimen- cock and Chodorow, 1998), and the Jaccard Index (JI) baseline. lin uses the information content
sional embedding for each textual level (e.g., para(Resnik, 1995) of the least common subsumer of
graph, sentence, phrase) instance. We can use a
similarity metric to calculate the similarity between concepts A and B. Information content (IC) indicates the specificity of a concept; the least comthese embeddings. For this task, systems should
report only the similarity between two specific cross mon subsumer of a concept A and B is the most
specific concept from which A and B are inherited.
level instances. Note that we used cosine similin similarity5 returns the difference between two
larity to calculate similarity between two textual
times of the IC of the least common subsumer of
units. This similarity is the eventual similarity for
A and B, and the sum of IC of both concepts. On
two instances; no further processing (e.g., scaling)
the other hand, lch is a score denoting how similar
has been done.
two concepts are, calculated by using the shortest
In this task, two correlation metrics were used
path that connects the concept and the maximum
to evaluate the systems: Pearson correlation and
depth of the taxonomy in which the concepts ocSpearman’s rank correlation. Pearson correlation
cur6 (please see Pedersen et al. (2004) for further
tests the degree of similarity between the system’s
similarity ratings and the gold standard ratings. Spear- details of these measures). These two baselines
were calculated as follows. First, using the Stanman’s rank correlation measures the degree of similarity between two rankings; similarity ratings provided by a system and the gold standard ratings.
94
5
lin similarity = 2 ∗ IC(lcs)/(IC(A) + IC(B)) where
lcs indicates the least common subsumer of concepts A and
B.
6
The exact formulation is −log(L/2d) where L is the
shortest path length and d is the taxonomy depth.
Pearson
0.837
0.834
0.826
0.732
0.698
0.527
0.629
0.612
0.640
Spearman
0.821
0.820
0.817
0.727
0.700
0.613
0.627
0.601
0.687
Sentence-2-Phrase
Paragraph-2-Sentence
System
Best
nd
2 Best
3rd Best
AI-KU1
AI-KU2
LCS
lch
lin
JI
Table 5: Paragraph-2-Sentence subtask scores for
the test data. Best indicates the best correlation
score for the subtask. LCS stands for Normalized
Longest Common Substring. Subscripts in AI-KU
systems specify the run number.
System
Best
nd
2 Best
3rd Best
AI-KU1
AI-KU2
LCS
lch
lin
JI
Pearson
0.777
0.771
0.760
0.680
0.617
0.562
0.526
0.501
0.540
Spearman
0.642
0.760
0.757
0.646
0.612
0.626
0.544
0.498
0.555
Table 6: Sentence2phrase subtask scores for the
test data.
graph2Sentence subtask, since smaller textual units
(such as phrases) make the problem more difficult.
4
ford Part-of-Speech Tagger (Toutanova and Manning, 2000) we tagged words across all textual levels. After tagging, we found the synsets of each
word matched with its part-of-speech using WordNet 3.0 (Miller and Fellbaum, 1998). For each
synset of a word in the shorter textual unit (e.g.,
sentence is shorter than paragraph), we calculated
the lin/lch measure of each synset of all words
in the longer textual unit and picked the highest
score. When we found the scores for all words,
we calculated the mean to find out the similarity
between one pair in the test set. Finally, Jaccard
Index baseline was used to simply calculate the
number of words in common (intersection) with
two cross textual levels, normalized by the total
number of words (union). Table 5 and 6 demonstrate the AI-KU runs on the test data. Next, we
present our results pertaining to the test data.
Conclusion
In this work, we introduced two unsupervised systems that utilize co-occurrence statistics and represent textual units as dense, low dimensional embeddings. Although current systems are based on
bag-of-word approach and discard the syntactic information, they achieved promising results in both
paragraph2sentence and sentence2phrase subtasks.
For future work, we will extend our algorithm by
adding syntactic information (e.g, dependency parsing output) into the co-occurrence modeling step.
References
Osman Baskaya, Enis Sert, Volkan Cirik, and Deniz
Yuret. 2013. AI-KU: Using substitute vectors and
co-occurrence modeling for word sense induction
and disambiguation. In Proceedings of the Second
Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Seventh International
Paragraph2Sentence: Both systems outperformed Workshop on Semantic Evaluation (SemEval 2013),
all the baselines for both metrics. The best score
pages 300–306.
for this subtask was .837 and our systems achieved
.732 and .698 on Pearson and did similar on Spearman metric. These scores are promising since our
current unsupervised systems are based on bag-ofwords approach — they do not utilize any syntactic information.
Jonathan Berant, Ido Dagan, and Jacob Goldberger.
2012. Learning entailment relations by global graph
structure optimization. Computational Linguistics,
38(1):73–111.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003. Latent dirichlet allocation. The Journal of
Machine Learning Research, 3:993–1022.
Sentence2Phrase: In this subtask, AI-KU sysAdriano Ferraresi, Eros Zanchetta, Marco Baroni, and
tems outperformed all baselines with the excepSilvia Bernardini. 2008. Introducing and evaluating
tion of the AI-KU2 system which performed slightly
ukwac, a very large web-derived corpus of english.
worse than LCS on Spearman metric. Performances
In In Proceedings of the 4th Web as Corpus Workof systems and baselines were lower than Parashop (WAC-4).
95
Amir Globerson, Gal Chechik, Fernando Pereira, and
Naftali Tishby. 2007. Euclidean embedding of cooccurrence data. Journal of Machine Learning Research, 8(10).
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys
(CSUR), 34(1):1–47.
Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2011. Learning to rank answers to nonfactoid questions from web collections. Computational Linguistics, 37(2):351–383.
Aminul Islam and Diana Inkpen. 2008. Semantic text
similarity using corpus-based word similarity and
string similarity. ACM Transactions on Knowledge
Discovery from Data (TKDD), 2(2):10.
Kristina Toutanova and Christopher D Manning. 2000.
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings
of the 2000 Joint SIGDAT conference on Empirical
methods in natural language processing and very
large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational
Linguistics-Volume 13, pages 63–70.
David Jurgens, Mohammed Taher Pilehvar, and
Roberto Navigli. 2014. Semeval-2014 task 3:
Cross-level semantic similarity. In Proceedings of
the 8th International Workshop on Semantic Evaluation (SemEval-2014). August 23-24, 2014, Dublin,
Ireland.
Claudia Leacock and Martin Chodorow. 1998. Combining local context and wordnet similarity for word
sense identification. WordNet: An electronic lexical
database, 49(2):265–283.
Peter D. Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research,
37:141–188.
Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram cooccurrence statistics. In Proceedings of the 2003
Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology-Volume 1, pages 71–78.
Mehmet Ali Yatbaz, Enis Sert, and Deniz Yuret. 2012.
Learning syntactic categories using paradigmatic
representations of word context. In Proceedings of
the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning, pages 940–951.
Dekang Lin. 1998. An information-theoretic definition of similarity. In ICML, volume 98, pages 296–
304.
Yariv Maron, Michael Lamar, and Elie Bienenstock.
2010. Sphere Embedding: An Application to Partof-Speech Induction. In J Lafferty, C K I Williams,
J Shawe-Taylor, R S Zemel, and A Culotta, editors,
Advances in Neural Information Processing Systems
23, pages 1567–1575.
George Miller and Christiane Fellbaum. 1998. Wordnet: An electronic lexical database.
Eui-Kyu Park, Dong-Yul Ra, and Myung-Gil Jang.
2005. Techniques for improving web retrieval effectiveness. Information processing & management,
41(5):1207–1223.
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet:: Similarity: measuring the relatedness of concepts. In Demonstration Papers at
HLT-NAACL 2004, pages 38–41.
Mohammad Taher Pilehvar, David Jurgens, and
Roberto Navigli. 2013. Align, disambiguate and
walk: A unified approach for measuring semantic
similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
(ACL 2013).
Philip Resnik. 1995. Using information content to
evaluate semantic similarity in a taxonomy. arXiv
preprint cmp-lg/9511007.
Hinrich Schütze. 1998. Automatic word sense discrimination. Computational Linguistics, 24(1):97–
123.
96