Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Keyword Extraction from a Single Document

using Word Co-occurrence Statistical Information

Yutaka Matsuo Mitsuru Ishizuka


National Institute of Advanced Industrial Science and Technology University of Tokyo
Aomi 2-41-6, Koto-ku, Tokyo 135-0064, Japan Hongo 7-3-1, Bunkyo-ku, Tokyo 113-8656, Japan
[email protected] [email protected]

Abstract tracted. Co-occurrences of a term and frequent terms are


counted. If a term appears selectively with a particular sub-
We present a new keyword extraction algorithm that set of frequent terms, the term is likely to have an important
applies to a single document without using a corpus.
Frequent terms are extracted first, then a set of co-
meaning. The degree of bias of the co-occurrence distribu-
occurrence between each term and the frequent terms, tion is measured by the χ 2 -measure. We show that our key-
i.e., occurrences in the same sentences, is generated. word extraction performs well without the need for a corpus.
Co-occurrence distribution shows importance of a term This paper is organized as follows. The next section de-
in the document as follows. If probability distribution of scribes our main idea of keyword extraction. We detail the
co-occurrence between term a and the frequent terms is algorithm, then evaluation and discussion are made. Finally,
biased to a particular subset of frequent terms, then term we summarize our contributions and conclude the paper.
a is likely to be a keyword. The degree of biases of dis-
tribution is measured by the χ2 -measure. Our algorithm
shows comparable performance to tfidf without using a Term Co-occurrence and Importance
corpus. A document consists of sentences. In this paper, a sentence
is considered to be a set of words separated by a stop mark
Introduction (“.”, “?” or “!”). Moreover, it includes a title of a document,
a title of a section, and a caption. Two terms in a sentence
Keyword extraction 1 is an important technique for document are considered to co-occur once. That is, we see each sen-
retrieval, Web page retrieval, document clustering, summa- tence as a “basket,” ignoring term order and grammatical
rization, text mining, and so on. By extracting appropriate information except when extracting word sequences.
keywords, we can choose easily which document to read or We can obtain frequent terms by counting term frequen-
learn the relation among documents. A popular algorithm cies. Let us take a very famous paper by Alan Turing (Tur-
for indexing is the tfidf measure, which extracts keywords ing 1950) as an example. Table 1 shows the top ten frequent
frequently that appear in a document, but don’t appear fre- terms (denoted as G) and the probability of occurrence, nor-
quently in the remainder of the corpus. malized so that the sum is to be 1 (i.e., normalized rela-
Recently, numerous documents have been made avail- tive frequency). Next, a co-occurrence matrix is obtained
able electronically. Domain-independent keyword extrac- by counting frequencies of pairwise term co-occurrence, as
tion, which does not require a large corpus, has many ap- shown in Table 2. For example, term a and term b co-occur
plications. For example, if one encounters a new Web page, in 30 sentences in the document. Let N denote the num-
one might like to know the contents quickly by some means, ber of different terms in the document. While the term co-
e.g., highlighting keywords. If one wants to know the main occurrence matrix is an N × N symmetric matrix, Table 2
assertion of a paper at hand, one would want to have some shows only a part of the whole – an N × 10 matrix. We do
keywords. In these cases, a keyword extraction without a not define diagonal components here.
corpus of the same kind of documents is very useful. Word Assuming that term w appears independently from fre-
count (Luhn 1957) is sometimes sufficient for document quent terms G, the distribution of co-occurrence of term w
overview; however, a more powerful tool is desirable. and the frequent terms is similar to the unconditional distri-
This paper explains a keyword extraction algorithm based bution of occurrence of the frequent terms shown in Table 1.
solely on a single document. First, frequent terms 2 are ex- Conversely, if term w has a semantic relation with a partic-
Copyright  c 2003, American Association for Artificial Intelli- ular set of terms g ∈ G, co-occurrence of term w and g is
gence (www.aaai.org). All rights reserved. greater than expected; the distribution is to be biased.
1
A term “keyword extraction” is used in the context of text min- Figures 1 and 2 show co-occurrence probability distribu-
ing, for example (Rajman & Besancon 1998). A comparable re- tion of some terms and the frequent terms. In the figures, un-
search topic is called “automatic term recognition” in the context of conditional distribution of frequent terms is shown as “un-
computational linguistics and “automatic indexing” or “automatic
keyword extraction” in the information retrieval research field. limit the meaning in a terminological sense.) A word sequence is
2
A term is a word or a word sequence. (We do not intend to written as a phrase.

392 FLAIRS 2003


Table 1: Frequency and probability distribution.
Frequent term a b c d e f g h i j Total
Frequency 203 63 44 44 39 36 35 33 30 28 555
Probability 0.366 0.114 0.079 0.079 0.070 0.065 0.063 0.059 0.054 0.050 1.0
a: machine, b: computer, c: question, d: digital, e: answer, f: game, g: argument, h: make, i: state, j: number

Table 2: A co-occurrence matrix.


a b c d e f g h i j Total
a – 30 26 19 18 12 12 17 22 9 165
b 30 – 5 50 6 11 1 3 2 3 111
c 26 5 – 4 23 7 0 2 0 0 67
d 19 50 4 – 3 7 1 1 0 4 89
e 18 6 23 3 – 7 1 2 1 0 61
f 12 11 7 7 7 – 2 4 0 0 50
g 12 1 0 1 1 2 – 5 1 0 23
h 17 3 2 1 2 4 5 – 0 0 34
i 22 2 0 0 1 0 1 0 – 7 33
j 9 3 0 4 0 0 0 0 7 – 23
··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···
u 6 5 5 3 3 18 2 2 1 0 45
v 13 40 4 35 3 6 1 0 0 2 104
w 11 2 2 1 1 0 1 4 0 0 22 Figure 2: Co-occurrence probability distribution of the
x 17 3 2 1 2 4 5 0 0 0 34 terms “imitation”, “digital computer”, and frequent terms.
u: imitation, v: digital computer, w:kind, x:make
which is very common for evaluating biases between ex-
pected frequencies and observed frequencies. For each term,
frequency of co-occurrence with the frequent terms is re-
garded as a sample value; a null hypothesis is that “occur-
rence of frequent terms G is independent from occurrence
of term w,” which we expect to reject.
We denote the unconditional probability of a frequent
term g ∈ G as the expected probability p g and the total num-
ber of co-occurrence of term w and frequent terms G as n w .
Frequency of co-occurrence of term w and term g is written
as f req(w, g). The statistical value of χ2 is defined as
 (f req(w, g) − nw pg )2
χ2 (w) = . (1)
nw pg
g∈G

Figure 1: Co-occurrence probability distribution of the If χ2 (w) > χ2α , the null hypothesis is rejected with signif-
terms “kind”, “make”, and frequent terms. icance level α. The term n w pg represents the expected fre-
quency of co-occurrence; and (f req(w, g) − n w pg ) repre-
conditional”. A general term such as ‘kind” or “make” is sents the difference between expected and observed frequen-
used relatively impartially with each frequent term, while cies. Therefore, large χ 2 (w) indicates that co-occurrence of
a term such as “imitation” or “digital computer” shows co- term w shows strong bias. In this paper, we use the χ 2 -
occurrence especially with particular terms. These biases measure as an index of biases, not for tests of hypotheses.
are derived from either semantic, lexical, or other relations Table 3 shows terms with high χ 2 values and ones with
of two terms. Thus, a term with co-occurrence biases may low χ2 values in the Turing’s paper. Generally, terms with
have an important meaning in a document. In this example, large χ2 are relatively important in the document; terms with
“imitation” and “digital computer” are important terms, as small χ2 are relatively trivial.
we all know: In this paper, Turing proposed an “imitation In summary, our algorithm first extracts frequent terms as
game” to replace the question “Can machines think?” a “standard”; then it extracts terms with high deviation from
the standard as keywords.
Therefore, the degree of biases of co-occurrence can be
used as a surrogate of term importance. However, if term
frequency is small, the degree of biases is not reliable. For Algorithm Description and Improvement
example, assume term w 1 appears only once and co-occurs This section details precise algorithm description and algo-
only with term a once (probability 1.0). On the other ex- rithm improvement based on preliminary experiments.
treme, assume term w2 appears 100 times and co-occurs
only with term a 100 times (with probability 1.0). Intu- Calculation of χ2 values
itively, w2 seems more reliably biased. In order to eval- A document consists of sentences of various lengths. If a
uate statistical significance of biases, we use the χ 2 test, term appears in a long sentence, it is likely to co-occur with

FLAIRS 2003 393


Table 3: Terms with high χ 2 value. Table 4: Two transposed columns.
Rank χ2 Term Frequency a b c d e f g h i j ...
1 593.7 digital computer 31 c 26 5 — 4 23 7 0 2 0 0 ...
2 179.3 imitation game 16 e 18 6 23 3 — 7 1 2 1 0 ...
3 163.1 future 4
4 161.3 question 44
Table 5: Clustering of the top 49 frequent terms.
C1: game, imitation, imitation game, play,
5 152.8 internal 3
programme
6 143.5 answer 39
C2: system, rules, result, important
7 142.8 input signal 3
C3: computer, digital, digital computer
8 137.7 moment 2
C4: behaviour, random, law
9 130.7 play 8
C5: capacity, storage, C6: question, answer
10 123.0 output 15
··· ···
.. .. .. ..
. . . . C26: human, C27: state, C28: learn
553 0.8 Mr. 2
554 0.8 sympathetic 2 occurrence of terms w and g 1 might imply the co-occurrence
555 0.7 leg 2 of w and g2 . Thus, term w will have a high χ 2 value; this is
556 0.7 chess 2 very problematic.
557 0.6 Pickwick 2 It is straightforward to extract an orthogonal set of
558 0.6 scan 2 columns, however, to prevent the matrix from becoming too
559 0.3 worse 2 sparse, we will cluster terms (i.e., columns).
560 0.1 eye 2 Many studies address term clustering. Two major ap-
(We set the top ten frequent terms as G.) proaches (Hofmann & Puzicha 1998) are:
Similarity-based clustering If terms w1 and w2 have simi-
many terms; if a term appears in a short sentence, it is less lar distribution of co-occurrence with other terms, w 1 and
likely to co-occur with other terms. We consider the length w2 are considered to be the same cluster.
of each sentence and revise our definitions. We denote Pairwise clustering If terms w1 and w2 co-occur fre-
• pg as (the sum of the total number of terms in sentences quently, w1 and w2 are considered to be the same cluster.
where g appears) divided by (the total number of terms in Table 4 shows an example of two (transposed) columns ex-
the document), tracted from a co-occurrence matrix. Similarity-based clus-
• nw as the total number of terms in sentences where w tering centers upon boldface figures and pairwise clustering
appears. focuses on italic figures.
By similarity-based clustering, terms with the same role,
Again nw pg represents the expected frequency of co- e.g., “Monday,” “Tuesday,” ..., or “build,” “establish,”
occurrence. However, its value becomes more sophisticated. and “found” are clustered (Pereira, Tishby, & Lee 1993).
A term co-occurring with a particular term g ∈ G has a Through our preliminary experiment, when applied to a
high χ2 value. However, these terms are sometimes adjuncts single document, similarity-based clustering groups para-
of term g and not important terms. For example, in Table 3, phrases, and a phrase and its component (e.g., “digital com-
a term “future” or “internal” co-occurs selectively with the puter” and “computer”). Similarity of two distributions
frequent term “state,” because these terms are used in the is measured statistically by Kullback-Leibler divergence or
form of “future state” and “internal state.” Though χ 2 values Jensen-Shannon divergence (Dagan, Lee, & Pereira 1999).
for these terms are high, “future” and “internal” themselves On the other hand, pairwise clustering yields relevant
are not important. Assuming the “state” is not a frequent terms in the same cluster: “doctor,” “nurse,” and “hospital”
term, χ2 values of these terms diminish rapidly. (Tanaka & Iwasaki 1996). A frequency of co-occurrence or
We use the following function to measure robustness of mutual information can be used to measure the degree of
bias values; it subtracts the maximal term from the χ 2 value, relevance (Church & Hanks 1990; Dunning 1993).
 
2 2 (f req(w, g) − nw pg )2 Our algorithm uses both types of clustering. First we clus-
χ (w) = χ (w) − max . (2) ter terms by a similarity measure (using Jensen-Shannon di-
g∈G nw pg
vergence); subsequently, we apply pairwise clustering (us-
Clustering of Terms ing mutual information). Table 5 shows an example of term
clustering. Proper clustering of frequent terms results in an
A co-occurrence matrix is originally an N ×N matrix, where appropriate χ 2 value for each term. 3
columns corresponding to frequent terms are extracted for Below, co-occurrence of a term and a cluster implies co-
calculation. We ignore the remaining of columns, i.e., co- occurrence of the term and any term in the cluster.
occurrence with low frequency terms, because it is difficult
to estimate precise probability of occurrence for low fre- Algorithm
quency terms.
To improve extracted keyword quality, it is very impor- The algorithm is shown as follows. Thresholds are deter-
tant to select the proper set of columns from a co-occurrence mined by preliminary experiments.
matrix. The set of columns is preferably orthogonal; as- 3
Here we don’t take the size of the cluster into account. Bal-
suming that terms g 1 and g2 appear together very often, co- ancing the clusters may improve the algorithm performance.

394 FLAIRS 2003


Table 6: Improved results of terms with high χ 2 value. Table 7: Precision and coverage for 20 technical papers.
Rank χ2 Term Frequency tf KeyGraph ours tfidf
1 380.4 digital computer 63 Precision 0.53 0.42 0.51 0.55
2 259.7 storage capacity 11 Coverage 0.48 0.44 0.62 0.61
3 202.5 imitation game 16 Frequency index 28.6 17.3 11.5 18.1
4 174.4 machine 203
5 132.2 human mind 2 Table 8: Results with respect to phrases.
6 94.1 universality 6 tf KeyGraph ours tfidf
7 93.7 logic 10 Ratio of phrases 0.11 0.14 0.33 0.33
8 82.0 property 11 Precision w/o phrases 0.42 0.36 0.42 0.45
9 77.1 mimic 7 Recall w/o phrases 0.39 0.36 0.46 0.54
10 77.0 discrete-state machine 17
extraction of phrases. The top 15 terms by each method
1. Preprocessing: Stem words by Porter algorithm (Porter
were extracted, gathered, and shuffled. Then, the authors
1980) and extract phrases based on the AP RIORI algo-
were asked to check terms which they think are important
rithm (Fürnkranz 1998). Discard stop words included in
in the paper. 7 Precision can be calculated by the ratio of the
stop list used in SMART system (Salton 1988). 4
checked terms to 15 terms derived by each method. Further-
2. Selection of frequent terms: Select the top frequent terms more, the authors were asked to select five (or more) terms
up to 30% of the number of running terms, N total . which they thought were indispensable for the paper. Cov-
3. Clustering frequent terms: Cluster a pair of terms whose erage of each method was calculated by taking the ratio of
Jensen-Shannon divergence is above the threshold (0.95× the indispensable terms included in the 15 terms to all the
log 2). Cluster a pair of terms whose mutual information indispensable terms. 8
is above the threshold (log(2.0)). The obtained clusters Results are shown in Table 7. For each method, preci-
are denoted as C. sion was around 0.5. However, coverage using our method
exceeds that of tf and KeyGraph and is comparable to that
4. Calculation of expected probability: Count the number of of tfidf; both tf and tfidf selected terms which appeared fre-
terms co-occurring with c ∈ C, denoted as n c , to yield quently in the document (although tfidf considers frequen-
the expected probability p c = nc /Ntotal . cies in other documents). On the other hand, our method
5. Calculation of χ 2 value: For each term w, count co- can extract keywords even if they do not appear frequently.
occurrence frequency with c ∈ C, denoted as f req(w, c). The frequency index in the table shows average frequency
Count the total number of terms in the sentences including of the top 15 terms. Terms extracted by tf appear about 28.6
w, denoted as nw . Calculate χ2 value following (2). times, on average, while terms by our method appear only
11.5 times. Therefore, our method can detect “hidden” key-
6. Output keywords: Show a given number of terms having
words. We can use χ2 value as a priority criterion for key-
the largest χ2 value.
words because precision of the top 10 terms by our method
Table 6 shows the result for Turing’s paper. Important is 0.52, that of the top 5 is 0.60, while that of the top 2 is as
terms are extracted regardless of their frequencies. high as 0.72. Though our method detects keywords consist-
ing of two or more words well, it is still nearly comparable
Evaluation to tfidf if we discard such phrases, as shown in Table 8.
For information retrieval, index terms are evaluated by their Computational time of our method is shown in Figure 3.
retrieval performance, namely recall and precision. How- The system is implemented in C++ on a Linux OS, Celeron
ever, we claim that our algorithm is useful when a corpus is 333MHz CPU machine. Computational time increases ap-
not available due to cost or time to collect documents, or in proximately linearly with respect to the number of terms; the
a situation where document collection is infeasible. process completes itself in a few seconds if the given number
The experiment was participated by 20 authors of tech- of terms is less than 20,000.
nical papers in artificial intelligence research. For each au-
thor, we showed keywords of his/her paper by tf(term fre- Discussion and Related Works
quency), tfidf 5 , KeyGraph6 (Ohsawa, Benson, & Yachida Co-occurrence has long attracted interest in computational
1998), and our algorithm. All these methods are equally linguistics. (Pereira, Tishby, & Lee 1993) clustered terms
equipped with word stem, elimination of stop words, and according to their distribution in particular syntactic con-
4
texts. (Tanaka & Iwasaki 1996) uses co-occurrence matri-
In this paper, we use both nouns and verbs because verbs or ces of two languages to translate an ambiguous term. From
verb+noun are sometimes important to illustrate the content of the
document. Of course, we can apply our algorithm only to nouns. 7
Keywords are sometimes attached to a paper; however, they
5
The corpus is 166 papers in JAIR (Journal of Artificial Intel- are not defined in a consistent way. Therefore, we employ author-
ligence Research) from Vol. 1 in 1993 to Vol. 14 in 2001. The based evaluation.
idf is defined by log(D/df (w)) + 1, where D is the number of all 8
It is desirable to have the indispensable term list beforehand.
documents and df (w) is the number of documents including w. However, it is very demanding for authors to provide a keyword list
6
This term-weighting algorithm, which is recently used to ana- without seeing a term list. In our experiment, we allowed authors
lyze a variety of data in the context of Chance Discovery,requires to add any terms in the paper to include in the indespensable term
only a single document. list (even if they were not derived by any of the methods.).

FLAIRS 2003 395


10
time
tional Linguistics 16(1):22.
Dagan, I.; Lee, L.; and Pereira, F. 1999. Similarity-
8
based models of word cooccurrence probabilities. Machine
Learning 34(1):43.
Dunning, T. 1993. Accurate methods for the statistics
6 of surprise and coincidence. Computational Linguistics
19(1):61.
time(s)

Feldman, R.; Fresko, M.; Kinar, Y.; Lindell, Y.; Liphstat,


4
O.; Rajman, M.; Schler, Y.; and Zamir, O. 1998. Text
mining at the term level. In Proceedings of the Second
European Symposium on Principles of Data Mining and
2
Knowledge Discovery, 65.
Fürnkranz, J. 1998. A study using n-grams features for text
categorization. Technical Report OEFAI-TR-98-30, Aus-
0
0 5000 10000 15000 20000
number of terms in a document
25000 30000
trian Research Institute for Artificial Intelligence.
Figure 3: Number of total terms and computational time. Hofmann, T., and Puzicha, J. 1998. Statistical models
for co-occurrence data. Technical Report AIM-1625, Mas-
probabilistic points of view, (Dagan, Lee, & Pereira 1999) sachusetts Institute of Technology.
describes a method for estimating probability of previously
Kageura, K., and Umino, B. 1996. Methods of automatic
unseen word combinations.
term recognition. Terminology 3(2):259.
Weighting a term by occurrence dates back to the 1950s
and the study by Luhn (1957). More elaborate measures of Luhn, H. P. 1957. A statistical approach to mechanized en-
term occurrence have been developed (Sparck-Jones 1972; coding and searching of literary information. IBM Journal
Noreault, McGill, & Koll 1977) by essentially counting term of Research and Development 1(4):390.
frequencies. (Kageura & Umino 1996) summarized five Nagao, M.; Mizutani, M.; and Ikeda, H. 1976. An auto-
groups of weighting measure: (i) a word which appears in mated method of the extraction of important words from
a document is likely to be an index term; (ii) a word which Japanese scientific documents. Transactions of Informa-
appears frequently in a document is likely to be an index tion Processing Society of Japan 17(2):110.
term; (iii) a word which appears only in a limited number Noreault, T.; McGill, M.; and Koll, M. B. 1977. A Perfor-
of documents is likely to be an index term for these docu- mance Evaluation of Similarity Measure, Document Term
ments; (iv) a word which appears relatively more frequently Weighting Schemes and Representations in a Boolean En-
in a document than in the whole database is likely to be an vironment. London: Butterworths.
index term for that document; (v) a word which shows a Ohsawa, Y.; Benson, N. E.; and Yachida, M. 1998. Key-
specific distributional characteristic in the database is likely Graph: Automatic indexing by co-occurrence graph based
to be an index term for the database. Our algorithm corre- on building construction metaphor. In Proceedings of the
sponds to approach (v). (Nagao, Mizutani, & Ikeda 1976) Advanced Digital Library Conference.
used χ2 value to calculate weight of words using a corpus.
Our method uses a co-occurrence matrix instead of a corpus, Pereira, F.; Tishby, N.; and Lee, L. 1993. Distributional
enabling keyword extraction using only the document itself. clustering of English words. In Proceedings of the 31th
In the context of text mining, to discover keywords or Meeting of the Association for Computational Linguistics,
relation of keywords are important topics (Feldman et al. 183–190.
1998; Rajman & Besancon 1998). The general purpose of Porter, M. F. 1980. An algorithm for suffix stripping. Pro-
knowledge discovery is to extract implicit, previously un- gram 14(3):130.
known, and potentially useful information from data. Our Rajman, M., and Besancon, R. 1998. Text mining – knowl-
algorithm can be considered as a text mining tool in that it edge extraction from unstructured textual data. In Proceed-
extracts important terms even if they are rare. ings of the 6th Conference of International Federation of
Classification Societies.
Conclusion Salton, G. 1988. Automatic Text Processing. Addison-
In this paper, we developed an algorithm to extract keywords Wesley.
from a single document. Main advantages of our method are Sparck-Jones, K. 1972. A statistical interpretation of term
its simplicity without requiring use of a corpus and its high specificity and its application in retrieval. Journal of Doc-
performance comparable to tfidf. As more electronic docu- umentation 28(5):111.
ments become available, we believe our method will be use-
ful in many applications, especially for domain-independent Tanaka, K., and Iwasaki, H. 1996. Extraction of lexical
keyword extraction. translations from non-aligned corpora. In Proceedings of
the 16th International Conference on Computational Lin-
guistics, 580.
References
Turing, A. M. 1950. Computing machinery and intelli-
Church, K. W., and Hanks, P. 1990. Word association gence. Mind 59:433.
norms, mutual information, and lexicography. Computa-

396 FLAIRS 2003

You might also like