Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
Figure 1: Co-occurrence probability distribution of the If χ2 (w) > χ2α , the null hypothesis is rejected with signif-
terms “kind”, “make”, and frequent terms. icance level α. The term n w pg represents the expected fre-
quency of co-occurrence; and (f req(w, g) − n w pg ) repre-
conditional”. A general term such as ‘kind” or “make” is sents the difference between expected and observed frequen-
used relatively impartially with each frequent term, while cies. Therefore, large χ 2 (w) indicates that co-occurrence of
a term such as “imitation” or “digital computer” shows co- term w shows strong bias. In this paper, we use the χ 2 -
occurrence especially with particular terms. These biases measure as an index of biases, not for tests of hypotheses.
are derived from either semantic, lexical, or other relations Table 3 shows terms with high χ 2 values and ones with
of two terms. Thus, a term with co-occurrence biases may low χ2 values in the Turing’s paper. Generally, terms with
have an important meaning in a document. In this example, large χ2 are relatively important in the document; terms with
“imitation” and “digital computer” are important terms, as small χ2 are relatively trivial.
we all know: In this paper, Turing proposed an “imitation In summary, our algorithm first extracts frequent terms as
game” to replace the question “Can machines think?” a “standard”; then it extracts terms with high deviation from
the standard as keywords.
Therefore, the degree of biases of co-occurrence can be
used as a surrogate of term importance. However, if term
frequency is small, the degree of biases is not reliable. For Algorithm Description and Improvement
example, assume term w 1 appears only once and co-occurs This section details precise algorithm description and algo-
only with term a once (probability 1.0). On the other ex- rithm improvement based on preliminary experiments.
treme, assume term w2 appears 100 times and co-occurs
only with term a 100 times (with probability 1.0). Intu- Calculation of χ2 values
itively, w2 seems more reliably biased. In order to eval- A document consists of sentences of various lengths. If a
uate statistical significance of biases, we use the χ 2 test, term appears in a long sentence, it is likely to co-occur with