3

Suppose D is a textual document, and

K = < k1, ..., kN >

represents a set of terms contained in the document. For instance:

D = "What a wonderful day, isn't it?"
K = <"wonderful","day">

My objective is to see if document D talks about all the words in K as a whole. For instance:

D = "The Ebola in Africa is spreading at high speed"
K = <"Ebola","Africa">

is a case in which D is strongly related to K, while:

D = "NEWS 1: Ebola is a dangerous disease that is causing thousands of deaths. Many governments are taking precautions to prevent its spread. NEWS 2: population in Africa is increasing."
K = <"Ebola","Africa">

is a case in which D is not related to K, since "Ebola" and "Africa" are mentioned in different points of the document, in separated sentences, and not related.

How can I synthesize this concept of "relatedness" of D to K? Is there some technique in the state of the art which can be exploited?

Thanks.

2 Answers 2

3

A vector space model is probably what your looking for.

You could turn D into the same format as K, a list of words e.g. <"What", "a", "wonderful", "day", "isn't" "it">. This is done by something called a tokenizer.

After this you could remove useless words that have no meaning, like "and", "the", "it" etc. The words to remove are called stop words, stored in a stop list.

You should also convert all words to lower case (or even upper case), so that "What" and "what" are not classed as different words.

After this the document can be expressed as a list of words and their frequencies (take a look at an inverted index).

Calculate the cosine similarity between the document (D) and the query (K).

1

There could be two approaches towards the solution for this problem. One a simple one applicable only in this case and the other a more general one.

Particular solution: I noticed that you have paragraph markers within your documents, namely the "News: ". You can treat the content within these markers as your indexing units, which will enable you to get retrieval scores for these paragraphs. As a post retrieval step, you can compute a document level retrieval score by aggregating (average or max) the individual paragraph scores.

General solution:

Consider the proximity between query terms. If a document is about the Ebola disease in Africa, it's more likely to find the terms Ebola and Africa in close proximity than far apart. Lucene supports positional indexing and making use of these positions in the retrieval score computation with the help of proximity aware query parser.

This is a something which web search engines use extensively.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.