Chapter 6 - Exercises

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Introduction to Information Retrieval

Chapter 6

Exercise 6.10
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9.
Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf
values from Figure 6.8.

Figure 6.9 tf values N=806,791


Figure 6.8 idf values

Solution

Doc1 Doc2 Doc3


car 44.55 6.6 39.6
auto 6.24 68.64 0
insurance 0 53.46 46.98
best 21 0 25.5

==================================================================================
Exercise 6.15
Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean normalized document
vectors for each of the documents, where each vector has four components, one for each of the four
terms.

Solution

Doc1 Doc2 Doc3


car 44.55 6.6 39.6
auto 6.24 68.64 0
insurance 0 53.46 46.98
best 21 0 25.5
length of di 49.65 85.81 66.52

doc1 = [0.8974, 0.1257, 0, 0.4230]


doc2 = [0.0756, 0.7867, 0.6127, 0]
doc3 = [0.5953, 0, 0.7062, 0.3833]
Exercise 6.17
With term weights as computed in Exercise 6.15, rank the three documents by computed score for the
query "car insurance", for each of the following cases of term weighting in the query:
1. The weight of a term is 1 if present in the query, 0 otherwise.
2. Euclidean normalized idf.

Solution
1. q = [1, 0, 1, 0] //[car, auto, insurance, best]
score(q, doc1)= 0.8974 //[0.8974*1 + 0.1257*0 + 0*1 + 0.4230*0]
score(q, doc2) = 0.6883 //[0.0756*1 + 0.7867*0 + 0.6127*1 + 0*0]
score(q, doc3) = 1.3015 //[0.5953*1 + 0*0 + 0.7062*1 + 0.3833*0]
Ranking: doc3, doc1, doc2

2. q = [0.4778, 0.6024, 0.4692, 0.4344] //[car, auto, insurance, best]


tf(t,q) idf norm idf
car 1 1.65 0.4778
auto 0 2.08 0.6024
insurance 1 1.62 0.4692
best 0 1.5 0.4344
3.453

score(q, doc1) = 0.6883 // [0.8974*0.4778 + 0.1257* 0.6024 + 0* 0.4692 + 0.4230*0.4344]


score(q, doc2) = 0.7975 //[0.0756*0.4778 + 0.7867*0.6024 + 0.6127*0.4692 + 0*0.4344]
score(q, doc3) = 0.7823 //[0.5953*0.4778 + 0*0.6024 + 0.7062*0.4692 + 0.3833*0.4344]
Ranking: doc2, doc3, doc1
==================================================================================
Exercise 6.19
Compute the vector space similarity between the query “digital cameras” and the document “digital
cameras and video cameras” by filling out the empty columns in Table 6.1. Assume N = 10,000,000,
logarithmic term weighting (wf columns) for query and document, idf weighting for the query only and
cosine normalization for the document only. Treat "and" as a stop word. Enter term counts in the tf
columns. What is the final similarity score?
Solution

Similarity score = 1.56+1.56 = 3.12


==================================================================================
Exercise 6.23
Refer to the tf and idf values for four terms and three documents in Exercise 6.10. Compute the two top
scoring documents on the query "best car insurance" for each of the following weighing schemes: (i)
nnn.atc; (ii) ntc.atc.

Figure 6.9 tf values N=806,791


Figure 6.8 idf values

Solution

(i) nnn.atc

nnn weights for documents

Score(q, doc1) = 15.12 + 1.06 +0 + 7.14 = 23.32


Score(q, doc2) = 2.24 + 11.65 + 18.15 + 0 = 32.04
Score(q, doc3) = 13.44 + 0 + 15.95 + 8.67 = 38.06
Ranking: doc3, doc2, doc1

..................................................................................................................
(ii) ntc.atc
ntc weight for doc1

ntc weight for doc2

ntc weight for doc3

ntc.atc

Score(q, doc1) = 0.762


Score(q, doc2) = 0.657
Score(q, doc3) = 0.916
Ranking: doc3, doc1, doc2

..................................................................................................................
tf-idf weights Doc1 Doc2 Doc3
car 44.55 6.6 39.6
auto 6.24 68.64 0
insurance 0 53.46 46.98
best 21 0 25.5
ntc.ltn weight for doc1
query doc1 Product
Term w(tf) idf tf-idf tf idf tf-idf norm' w
car 1 1.65 1.65 27 1.65 44.55 0.8974 1.4807
auto 0 2.08 0 3 2.08 6.24 0.1257 0
insurance 1 1.62 1.62 0 1.62 0 0 0
best 1 1.5 1.5 14 1.5 21 0.4230 0.6345
49.65
ntc.ltn weight for doc2
query doc2 Product
Term w(tf) idf tf-idf tf idf tf-idf norm' w
car 1 1.65 1.65 4 1.65 6.6 0.0756 0.1247
auto 0 2.08 0 33 2.08 68.64 0.7867 0
insurance 1 1.62 1.62 33 1.62 53.46 0.6127 0.9926
best 1 1.5 1.5 0 1.5 0 0 0
85.81
ntc.ltn weight for doc3
query doc3 Product
Term w(tf) idf tf-idf tf idf tf-idf norm' w
car 1 1.65 1.65 24 1.65 39.6 0.5953 0.9822
auto 0 2.08 0 0 2.08 0 0 0
insurance 1 1.62 1.62 29 1.62 46.98 0.7062 1.144
best 1 1.5 1.5 17 1.5 25.5 0.3833 0.575
66.52

ntc.ltn
product
Term doc1 doc2 doc3
car 1.4807 0.1247 0.9822
auto 0 0 0
insurance 0 0.9926 1.144
best 0.6345 0 0.575
Score 2.1152 1.1173 2.7012

Score(q, doc1) = 1.4807+0.6345 = 2.1152


Score(q,doc2) = 0.1247 + 0.9926 = 1.1173
Score(q,doc3) = 0.9822 + 1.144 + 0.575 = 2.7012
Ranking: doc3, doc1, doc2
==================================================================================

You might also like