Information Retrieval Solutions Manual

Information Retrieval Solutions
Manual
Mpendulo Mamba | 105998416 | [email protected]
Chapter 1
Solutions for chapter 1 exercises
terms postings lists
forecasts 1
home 1 -> 2 -> 3 -> 4
in 2 -> 3
increase 3
july 2 -> 3 -> 4
new 1 -> 4
rise 2 -> 4
sales 1 -> 2 -> 3 -> 4
top 1
Exercise 1.1
Draw the inverted index that would be built for the following document collection. (See
Figure 1.3 for an example.)
Doc 1: new home sales top forecasts
Doc 2: home sales rise in july
Doc 3: increase in home sales in july

Doc 4: july new home sales rise
My Answer:
Exercise 1.2
Draw the term-document incidence matrix for this document collection. Draw the inverted
index representation for this collection
+
Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients
My Answer:
terms/doc Doc 1 Doc 2 Doc 3 Doc 4
approach 0 0 1 0
breakthrough 1 0 0 0
drug 1 1 0 0
for 1 0 1 1
hopes 0 0 0 1
new 0 1 1 1
of 0 0 1 0
Page 2
terms/doc Doc 1 Doc 2 Doc 3 Doc 4
patients 0 0 0 1
schizophrenia 1 1 1 1
treatment 0 0 1 0
terms postings lists
approach 3
breakthrough 1
drug 1 -> 2
for 1 -> 3 -> 4
hopes 4
new 2 -> 3 -> 4
of 3
patients 4
schizophrenia 1 -> 2 -> 3 -> 4
treatment 3
Exercise 1.3
For the document collection shown in Exercise 1.2, what are the returned results for
these queries:
Page 3
a. schizophrenia AND drug
Solution
Here we use the term-document incidence matrix to perform a boolean retrieval for
the given query
For the terms schizophrenia and drug, we take the row (or vector) which indicate
the document the term appears in,
schizophrenia - 1 1 1 1
drug - 1 1 0 0
Doing a bitwise AND operation for each of the term vectors gives,
1 1 1 1 AND 1 1 0 0 = 1 1 0 0
The result vector 1 1 0 0 gives Doc 1 and Doc 2 as the documents in which the terms
schizophrenia AND drug both are present.
b. for AND NOT(drug OR approach)
for AND NOT (drug OR approach)
Term vectors
for - 1 0 1 1
drug - 1 1 0 0
approach - 0 0 1 0
First we do a boolean bit wise OR for drug, approach, which gives
1 1 0 0 OR 0 0 1 0 = 1 1 1 0
The we do a NOT operation on 1 1 1 0 (i.e. on drug OR approach), which gives 0 0 0 1
Then we do an AND operation on 1 0 1 1 (i.e. for) AND 0 0 0 1 (i.e. NOT(drug OR
Page 4
approach)), which gives 0 0 0 1
Thus the document that contains for AND NOT (drug OR approach) is Doc 4.
These exercise illustrate the Boolean Retrieval model for search of query terms in
given list of documents.
Exercise 1.4
For the queries below, can we still run through the intersection in time
O(x+y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not,
what can
we achieve?
a) Brutus AND NOT Caesar
Solution:
Time is O(x+y). Instead of collecting documents that occur in both postings lists,
collect
those that occur in the first one and not in the second.
b) Brutus OR NOT Caesar
Solution:
Time is O(N) (where N is the total number of documents in the collection) assuming we
need
to return a complete list of all documents satisfying the query. This is because the length
of
the results list is only bounded by N, not by the length of the postings lists.
Page 5
Exercise 1.5
Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time
complexity? For instance, consider:
(Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in linear
time? Linear in what? Can we do better than this?
+
My Answer:
For the query, it is assumed that the length of the posting list for each word is s1, s2, s3,
s4, respectively. The time complexity on the left of the entire query is O (s1 + s2), and
the time complexity on the right is O (s3 + s4). According to the discussion of the
previous question, we find that the time complexity of the AND NOT operation is O (x +
y), so the total time complexity is time complexity or O (s1 + s2 + s3 + s4, or linear. +
In conclusion, the time complexity of A AND B, A OR B, A AND NOT B is O (x + y, but the

time complexity of A OR NOT B is O (N)
Exercise 1.6
We can use distributive laws for AND and OR to rewrite queries.
a. Show how to rewrite the query in Exercise 1.5 into disjunctive normal form using
the distributive laws.
b. Would the resulting query be more or less efficiently evaluated than the original
form of this query?
c. Is this result true in general or does it depend on the words and the contents of
the document collection?
a.
(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
= (Brutus OR Caesar) AND NOT Antony AND NOT Cleopatra
= (Brutus AND (NOT Antony) AND(NOT Cleopatra)) OR (Caesar AND (NOT Antony)
AND(NOT Cleopatra))
Page 6
b. The resulting query would be more efficiently evaluated than the originalform of
this query.
drug 1 2
for 1 3 4
hopes 4
new 2 3 4
of 3
patients 4
schizophrenia 1 2 3 4
treatment 3
c. It depends on the words and the contents ofthe document collection.
Exercise 1.7
Recommend a query processing order for
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
given the following postings list sizes:
Term Postings size
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
Solution:
Using the conservative estimate of the length of the union of postings lists, the
recommended order
is:
(kaleidoscope OR eyes) (300,321) AND (tangerine OR trees) (363,465) AND (marmalade

OR skies)
Page 7
(379,571)
However, depending on the actual distribution of postings, (tangerine OR trees) may well
be longer
than (marmalade OR skies), because the two components of the former are more
asymmetric. For
example, the union of 11 and 9990 is expected to be longer than the union of 5000 and
5000 even
though the conservative estimate predicts otherwise.
Exercise 1.8
If the query is:
+
friends AND romans AND (NOT countrymen)
how could we use the frequency of countrymen in evaluating the best query evaluation
order? In particular, propose a way of handling negation in determining the order of query
processing.
My Answer:
We need to record all the number of documents N, and how many documents contain the
word, recorded as E, then when the decision to deal with the order we observe the size of
$ N-E $ on it.
Exercise 1.9
For a conjunctive query, is processing postings lists in order of size guaranteed to be
optimal? Explain why it is, or give an example where it isn't.
+
My Answer:
It is not necessarily the best order of processing. Suppose there are three words and the size of the
list is s1 = 100, s2 = 105, s3 = 110. Assuming that the intersection size of s1 and s2 is 100, the
intersection size of s1 and s3 is zero. If it is in accordance with s1, s2, s3 order to deal with, need 100
+105 +100 + 110 = 415. But if the order of s1, s3, s2 to deal with, only need 100 + 110 +0 +0 = 210
Page 8
Exercise 1.10
Write out a postings merge algorithm,for an xx OR yy query.

+
My Answer
OR(p1,p2)
answer <- ()
while p1 != NIL and p2!= NIL
if docID(p1) < docID(p2)
ADD(answer,docID(p1))
p1<-next(p1)
else if docID(p2) < docID(p1)
p2<-next(p2)
else
p1<-next(p1)
p2<-next(p2)
while p1 != NIL
p1<-next(p1)
while p2 != NIL
p2<-next(p2)
Exercise 1.11
Have been written before, you can refer to.
Exercise 1.12
Assume a biword index. Give an example of a document which will be returned for a query
of New York University but is actually a false positive which should not be returned.
+
My Answer:
Divided into New York AND York University, obviously New York University is a specific
phrase, the front of the query will certainly go back to the wrong results.
Shown below is a portion of a positional index in the format: term: doc1: position1,
position2, ...; doc2: position1, position2, ...; etc.
angels: 2: 36,174,252,651; 4: 12,22,102,432; 7: 17;
Page 9
fools: 2: 1,17,74,222; 4: 8,78,108,458; 7: 3,13,23,193;
fear: 2: 87,704,722,901; 4: 13,43,113,433; 7: 18,328,528;
in: 2: 3,37,76,444,851; 4: 10,20,110,470,500; 7: 5,15,25,195;
rush: 2: 2,66,194,321,702; 4: 9,69,149,429,569; 7: 4,14,404;
to: 2: 47,86,234,999; 4: 14,24,774,944; 7: 199,319,599,709;
tread: 2: 57,94,333; 4: 15,35,155; 7: 20,320;
where: 2: 67,124,393,1001; 4: 11,41,101,421,431; 7: 16,36,736;
Exercise 1.13
Which document(s) if any match each of the following queries, where each expression
within quotes is a phrase query?
fools rush in fools rush in AND angels fear to tread

My Answer:
First deal with the first query, here also consider the order of words to meet fools first, followed by
rush, and finally in. Where doc2 and doc4, doc7 are met.
Then look at the second query, AND part of the right to get doc4, so the whole result is doc4.
Exercise 1.14
Consider the following fragment of a positional index with the format: word:
document: position, position, ; document: position, ...
Gates: 1: 3; 2: 6; 3: 2,17; 4: 1;
IBM: 4: 3; 7: 14;
Microsoft: 1: 1; 2: 1,21; 3: 3; 5: 16,22,51;
The /kk operator, word1 /kk word2 finds occurrences of word1 within kk words of word2
(on either side), where kk is a positive integer argument. Thus k=1k=1 demands that
word1 be adjacent to word2.
1. Describe the set of documents that satisfy the query Gates /2 Microsoft.
2. Describe each set of values for $k$ for which the query Gates / kk Microsoft returns
a different set of documents as the answer.
My Answer:
Page 10
The first question, by definition, doc1 and doc3 are satisfied. The second question, see the following
table:
K result
1 doc3
2 doc1,doc3
3 doc1,doc3
4 doc1,doc3
5 doc1,doc2,doc3
... doc1,doc2,doc3
Exercise 1.19
In the permuterm index, each permuterm vocabulary term points to the original vocabulary
term(s) from which it was derived. How many original vocabulary terms can there be in the
postings list of a permuterm vocabulary term?
My Answer:
original vocabulary ( hello$)
Exercise 1.20
Write down the entries in the permuterm index dictionary that are generated by the term
mama.
My Answer:
mama$ , ama$m , ma$ma , a$mam , $mama
Page 11
Exercise 1.21
If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do
the lookup on?
My Answer:
ng$s*
Exercise 1.22
Refer to Figure 3.4 ; it is pointed out in the caption that the vocabulary terms in the
postings are lexicographically ordered. Why is this ordering useful?
Exercise 1.23
Consider again the query fimoer from Section 3.2.1 . What Boolean query on a bigram
index would be generated for this query? Can you think of a term that matches the
permuterm query in Section 3.2.1 , but does not satisfy this Boolean query?
My Answer:
The Boolean query is f AND fi AND mo AND er AND r, filibuster boolean

query permuterm query
Exercise 1.24
Give an example of a sentence that falsely matches the wildcard query mon*h if the search
were to simply use a conjunction of bigrams.
My Answer:
Monday hash
Chapter 2
Exercise 2.1
Are the following statements true or false?
Page 12
a. In a Boolean retrieval system, stemming never lowers precision.
b. In a Boolean retrieval system, stemming never lowers recall.
c. Stemming increases the size of the vocabulary.
d. Stemming should be invoked at indexing time but not while processing a query.
a. False
b. True
c. False
d. False
Exercise 2.2
Exercise 2.3
The following pairs of words are stemmed to the same form by the Porter stemmer.
Which pairs would you argue shouldnt be conflated. Give your reasoning.
a. abandon/abandonment
b. absorbency/absorbent
c. marketing/markets
d. university/universe
e. volume/volumes
c. marketing/market should not be conflated
d. university/universe should not be conflated
Exercise 2.4
For the Porter stemmer rule group shown in (2.1):
a. What is the purpose of including an identity rule such as SS SS?
Page 13
b. Applying just this rule group, what will the following words be stemmed to?
circus canaries boss
c. What rule should be added to correctly stem pony?
d. The stemming for ponies and pony might seem strange. Does it have a deleterious
effect on retrieval? Why or why not?
Imagine the rule SS->SS was not in the algorithm. Then words like caress would not be
recognized at all and it would seem that algorithm can't do anything to reduce it to a stem.
However, with the rule SS->SS the stemmer says: "I recognize the word caress and I reduce it
to caress. I'm done". The alternative would be: "I can't do anything". Of course it is fictitious work
but what matters since is that it increases the precision of the stemmer. You can see that when
the testing of the algorithm is being done. If this rule was not in the stemmer the results would
have been different (worse). Look at the word list [ridiculousness, caress]
Case 1. Rule SS->SS in the algorithm.
Stemming:
caress (Step 1a)-> caress OK

ridiculousness (Step 2)-> ridiculous (step 4) -> ridicul OK
Success rate: 100%
Case 2. Rule SS->SS not in the algorithm.
Stemming:
caress -> fail OK

ridiculousness (Step 2)-> ridiculous (step 4) -> ridicul OK
Success rate: 50%
From practical point of view this rule doesn't matter. It's just a formalism.
Exercise 2.5
Why are skip pointers not useful for queries of the form x OR y?
Because you don't limit the number the number of documents with
an 'OR' query. The skip lists would point to every document.
Exercise 2.6
We have a two-word query. For one term the postings list consists of the following 16
entries:
[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180]
and for the other it is the one entry postings list:
Page 14
[47].
Work out how many comparisons would be done to intersect the two postings lists
with the following two strategies. Briefly justify your answers:
a. Using standard postings lists
b. Using postings lists stored with skip pointers, with a skip length of P, as suggested
in Section 2.3
(i) using standard postings lists
Ans.
11
(Compare 47 with entries 4 through 47)
(ii) using postings lists stored with skip pointers, with a skip length of
length , as recommended in class. Briefly justify your answer.
Ans.
(Compare 47 with entries 4,14,22,120,32,47)
(iii) Explain briefly how skip pointers could be made to work if we wanted to
make use of gamma-encoding on the gaps between successive docIDs.
Ans.
Use absolute encodings rather than gap encodings for the target of skips.
Exercise 2.7
Consider a postings intersection between this postings list, with skip pointers:
3 5 9 15 24 39 60 68 75 81 84 89 92 96 97 100 115
and the following intermediate result postings list (which hence has no skip pointers):
3 5 89 95 97 99 100 101
Trace through the postings intersection algorithm in Figure 2.10 (page 37).
a. How often is a skip pointer followed (i.e., p1 is advanced to skip(p1))?
Page 15
b. How many postings comparisons will be made by this algorithm while intersecting
the two lists?
c. How many postings comparisons would be made if the postings lists are
intersectedwithout the use of skip pointers?
a. 1 time, 2475
b. 18
3=3 5=5 9<89 15<89 24<89 75<89 92>89 81<89 84<89 89=89 95>92 95<115
95<96 97 > 96 97=97 99<100 100=100 101<115
c. 19
3=3 5=5 89>9 89>15 89>24 89>39 89>60 89>68 89>75 89>81 89>84 89=89 95>92
95<96 97>96 97=97 99<100 100=100 101<115
Exercise 2.8
{Already answered in 1.12}
Exercise 2.9
Shown below is a portion of a positional index in the format: term: doc1: hposition1,
position2, . . . i; doc2: hposition1, position2, . . . i; etc.
angels: 2: h36,174,252,651i; 4: h12,22,102,432i; 7: h17i;
fools: 2: h1,17,74,222i; 4: h8,78,108,458i; 7: h3,13,23,193i;
fear: 2: h87,704,722,901i; 4: h13,43,113,433i; 7: h18,328,528i;
in: 2: h3,37,76,444,851i; 4: h10,20,110,470,500i; 7: h5,15,25,195i;
rush: 2: h2,66,194,321,702i; 4: h9,69,149,429,569i; 7: h4,14,404i;
to: 2: h47,86,234,999i; 4: h14,24,774,944i; 7: h199,319,599,709i;
tread: 2: h57,94,333i; 4: h15,35,155i; 7: h20,320i;
where: 2: h67,124,393,1001i; 4: h11,41,101,421,431i; 7: h16,36,736i;
Which document(s) if anymatch each of the following queries,where each expression
within quotes is a phrase query?
a. fools rush in
Page 16
b. fools rush in AND angels fear to tread
a. fools rush in document 2, position 1; document 4, position 8;
document 7, position 3, position 13
b. angels fear to tread document 4, position 12
fools rush in AND angels fear to tread is in document 4
Page 17

Information Retrieval Solutions Manual

Uploaded by

Copyright:

Available Formats

Information Retrieval Solutions Manual

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval Solutions Manual

Uploaded by

Copyright:

Available Formats

Information Retrieval Solutions

terms postings lists

home 1 -> 2 -> 3 -> 4

july 2 -> 3 -> 4

sales 1 -> 2 -> 3 -> 4

Doc 1: new home sales top forecasts

Doc 2: home sales rise in july

Doc 3: increase in home sales in july

Doc 1: breakthrough drug for schizophrenia

Doc 2: new schizophrenia drug

Doc 3: new approach for treatment of schizophrenia

Doc 4: new hopes for schizophrenia patients

terms/doc Doc 1 Doc 2 Doc 3 Doc 4

terms postings lists

for 1 -> 3 -> 4

new 2 -> 3 -> 4

schizophrenia 1 -> 2 -> 3 -> 4

the given query

the document the term appears in,

schizophrenia AND drug both are present.

b. for AND NOT(drug OR approach)

for AND NOT (drug OR approach)

First we do a boolean bit wise OR for drug, approach, which gives

The we do a NOT operation on 1 1 1 0 (i.e. on drug OR approach), which gives 0 0 0 1

Then we do an AND operation on 1 0 1 1 (i.e. for) AND 0 0 0 1 (i.e. NOT(drug OR

given list of documents.

a) Brutus AND NOT Caesar

b) Brutus OR NOT Caesar

In conclusion, the time complexity of A AND B, A OR B, A AND NOT B is O (x + y, but the

the distributive laws.

form of this query?

the document collection?

(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)

= (Brutus OR Caesar) AND NOT Antony AND NOT Cleopatra

c. It depends on the words and the contents ofthe document collection.

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

given the following postings list sizes:

Term Postings size

(kaleidoscope OR eyes) (300,321) AND (tangerine OR trees) (363,465) AND (marmalade

though the conservative estimate predicts otherwise.

Write out a postings merge algorithm,for an xx OR yy query.

angels: 2: 36,174,252,651; 4: 12,22,102,432; 7: 17;

fear: 2: 87,704,722,901; 4: 13,43,113,433; 7: 18,328,528;

in: 2: 3,37,76,444,851; 4: 10,20,110,470,500; 7: 5,15,25,195;

rush: 2: 2,66,194,321,702; 4: 9,69,149,429,569; 7: 4,14,404;

to: 2: 47,86,234,999; 4: 14,24,774,944; 7: 199,319,599,709;

tread: 2: 57,94,333; 4: 15,35,155; 7: 20,320;

where: 2: 67,124,393,1001; 4: 11,41,101,421,431; 7: 16,36,736;

fools rush in fools rush in AND angels fear to tread

original vocabulary ( hello$)

mama$ , ama$m , ma$ma , a$mam , $mama

The Boolean query is f AND fi AND mo AND er AND r, filibuster boolean

b. In a Boolean retrieval system, stemming never lowers recall.

c. Stemming increases the size of the vocabulary.

c. marketing/market should not be conflated

d. university/universe should not be conflated

a. What is the purpose of including an identity rule such as SS SS?

circus canaries boss

c. What rule should be added to correctly stem pony?