Ir Notes
Ir Notes
Ir Notes
http://informationretrieval.org
Retrieval
1/
60
Boolean
retrieval
7/
60
Outline
1 Introduction
2 Inverted
index
1 Processing Boolean queries
1 Query optimization
5 Course
overview
9/
60
Unstructured data in 1650:
Shakespeare
10 /
60
Unstructured data in
1650
11 /
60
Term-document incidence
matrix Anthon
y Juliu
s The Haml
et Othell
o Macbet
h ..
Tempe .
and Caes
Cleopat ar st
ra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
12 /
60
Incidence
vectors
13 /
60
0/1 vectors and result of bitwise
operations Anthon
y Juliu
s The Haml
et Othell
o Macbet
h ..
Tempe .
and Caes
Cleopat ar st
ra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
14 /
60
Answers to
query
15 /
60
Bigger
collections
16 /
60
Can’t build the incidence
matrix
17 /
60
Inverted
Index
Caesar − 1 2 4 5 6 1 57 13 .
→ 6 2 .
.
Calpurnia
. − 2 3 5 10
s ˛¸ x
dictionar → 1s 4 1
y ˛¸
postings x
.
18 /
60
Tokenization and
preprocessing
Doc 1. I did enact Julius Caesar: I Doc 1. i did enact julius caesar i was
was killed i’ the Capitol; Brutus =⇒ killed i’ the capitol brutus killed me
Doc 2. so let it be with caesar the
killed me. noble brutus hath told you caesar was ambitious
20 /
60
Generate
postings term docID
i
1
did
1
enact
1
julius
1
Doc 1. i did enact julius caesar i caesar
was killed i’ the capitol brutus
killed2.
Doc meso let it be with caesar the =⇒ 1
noble brutus hath told you caesar i
was ambitious
1
was
1
killed
1
i’
1
the
1
capitol
1
brutus
1
killed
21 /
60
1
Sort
postings
term docID
docID
i did 1
term
ambitious 2
enact 1 be
julius 1
caesar 1 2
i 1 brutus
was
killed
i’the 1
capitol
brutus 1 1
killed
1 brutus
1
1 2
1 capitol
1
1 1
me
2
1 =⇒ caesar
i’
it
1
so 2
1
let 2 caesar 1
it julius
2 1
killed
2
be 2 caesar 1
killed
with 2 2 2
let did
caes
ar 2 1
me 1
enact
the 2 2
noble
1
nobl
e 2 hath 2
so
brut
us 2 1 1
the i
hath 2 2
the 1
told 2 i 2
told
1
you 2 2
you 22 /
60
Create postings lists, determine document
frequency
term docID
ambitious 2
be term doc. freq. → postings lists
bru
tus 1
2 bru 2 ambitious 1 → 2
tus
ca 1 2
pit be 1 →
ol
ca 1 brutus 2 → 1
→
2
es
ar
ca
es 2 capitol 1 → 1
ar
ca
es 2 caesar 2 → 1
→
ar 2
it did 21 did 1 → 1
en
act 1 enact 1 → 1
hat
h 1 hath 1 → 2
i 1 i1 → 1
i 1 i’ 1 → 1
i’
=
1
⇒
it 1
→2
wasjuli
us 21 julius 1 → 1
with 2
kill
ed 1 killed 1 → 1
kill
ed 1 let 1 → 2
let 2 me 1 → 1
me 1 noble 1 → 2
no
ble 2 so 1 → 2
23 /
60
so 2 the 2 → 1
→
Split the result into dictionary and postings
file
Brutus −
→ 1 2 4 11 3
1 4
5 17
3 17
4
Caesar − 1 2 4 5 6 1 57 13 .
→ 6 2 .
.
Calpurnia
. − 2 3 5 10
s ˛¸ x
dictionar → 1s 4 1
y. ˛¸ file
postings x
24 /
60
Outline
1 Introduction
2 Inverted
index
1 Processing Boolean queries
1 Query optimization
5 Course
overview
26 /
60
Simple conjunctive query (two
terms)
4
5
27 /
60
Intersecting two postings
lists
28 /
60
Intersecting two postings
lists
Intersect(p1, p2)
1answer ← ( )
2while p1 ƒ= nil and p2 ƒ= nil
3do if docID(p1) = docID(p2)
4then Add(answer , docID(p1))
5p1 ← next(p1)
6p2 ← next(p2)
7else if docID(p1) < docID(p2)
8then p1 ← next(p1)
9else p2 ← next(p2)
10return answer
29 /
60
Query processing:
Exercise
france −
→ 15→ 2 → 3 → 4 → 5 → 7 → 8 → 9 → 11 → 12 → 13 → 14 →
1
paris − 2 → 6 → 10 → 12 → 14
→
Compute − for ((paris
lear hit list
→ 12 → 15 AND NOT france) OR lear)
30 /
60
Boolean retrieval model:
Assessment
The Boolean
a Boolean retrieval model can answer any query that is
expression.
Boolean queries are queries that use and, or and not to join
query terms.
Views eachDocument
Is precise: documentmatches
as a set condition
of terms. or not.
Primary commercial retrieval tool for 3 decades
Many professional
Boolean queries. searchers (e.g., lawyers) still like
You know exactly what you are getting.
Many search systems you use are also Boolean: spotlight,
email, intranet etc.
31 /
60
Commercially successful Boolean retrieval:
Westlaw
32 /
60
Introduction to Information
http://informationretrieval.org
Retrieval
1/
62
Outline
1 Recap
2 Document
s
3 Terms
General + Non-English English
4 Skip pointers
5 Phrase
queries
15 /
62
Definition
s
16 /
62
Normalizatio
n
17 /
62
Normalization: Other
languages
18 /
62
Tokenization: Recall construction of inverted
index
Input:
Friends, Romans, countrymen. So let it be
with Caesar . . .
Output:
friend roman countryman so . . .
Each token is a candidate for a postings entry. What are
valid tokens to emit?
19 /
62
Exercise
s
In June, the dog likes to chase the cat in the barn. – How
many word tokens? How many word types? Why tokenization
is difficult
20 /
62
Tokenization problems: One word or two? (or
several)
21 /
62
Number
s
3/20/91
20/3/91
Mar 20, 1991 B-52 100.2.86.144
(800) 234-2333
800.234.2333
Older IR systems may not index numbers
...
. . . but generally it’s a useful feature.
Google example
22 /
62
Chinese: No
whitespace
!"#$ fi%&'%()fi*+,-&<'=>?y@.$/0123
456!"#$%()789:;
5/@5AfiB6!"#$CD=E(,FG/
23 /
62
Ambiguous segmentation in
Chinese
The two
characters can be treated as one word meaning ‘monk’ or
sequence
as a of two words meaning ‘and’ and
‘still’.
24 /
62
Outline
1 Recap
2 Document
s
Terms
1
General + Non-English
English
1 Skip pointers
5 Phrase
queries
30 /
62
Case
folding
31 /
62
Stop
words
32 /
62
More equivalence
classing
33 /
62
Lemmatizatio
n
34 /
62
Stemmin
g
35 /
62
Porter
algorithm
36 /
62
Porter stemmer: A few
rules
Rule Example
SSES caresses →
ponies
→ caress caress
cats →
SS IES
poni
→ →
I caress
SS →
→ cat 37 /
62
Three stemmers: A
comparison
Sample text: Such an analysis can reveal features that are not easily
visible from the variations in the individual genes and can lead to a
picture of expression that is more biologically transparent and accessible
to interpretation
Porter stemmer: such an analysi can reveal featur that ar not easili visibl
from the variat in the individu gene and can lead to a pictur of express
that is more biolog transpar and access to interpret
Lovins
th stemmer:
varitranspar suchgen
in th individu an and
analys
cancan
leadreve
to afeatur
picturthat ar not eas
of expres thatvisis from
mor
biolog
Paice stemmer: and acces
such an to interpres
analys can rev feat that are not easy visis from
the vary in the individ gen and
biolog transp and access to interpretcan lead to a pict of express that mor
38 /
62
Does stemming improve
effectiveness?
39 /
62
Exercise: What does Google
do?
Stop words
Normalization
Tokenization
Lowercasing
Stemming
Non-latin
alphabets
Umlauts
Compounds
Numbers
40 /
62
Introduction to Information
http://informationretrieval.org
Retrieval
1/
Outline
1 Reca
p
2 Why ranked
retrieval?
3 Term
frequency
4 tf-idf weighting
13 /
65
Ranked
retrieval
14 /
65
Problem with Boolean search: Feast or
famine
15 /
65
Feast or famine: No problem in ranked
retrieval
16 /
65
Scoring as the basis of ranked
retrieval
17 /
65
Query-document matching
scores
18 /
65
Take 1: Jaccard
coefficient
between 0 and 1.
19 /
65
Jaccard coefficient:
Example
What is the
Jaccard query-document
coefficient computesmatch
for: score that the
Query: “ides of March” Document “Caesar died in March”
jaccard(q, d ) = 1/6
20 /
65
What’s wrong with
Jaccard?
21 /
65
Outline
1 Reca
p
2 Why ranked
retrieval?
3 Term
frequency
4 tf-idf weighting
22 /
65
Binary incidence
matrix Anthon
y Juliu
s The Haml
et Othell
o Macbet
h ..
Tempe .
and Caes
Cleopat ar st
ra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0 23 /
65
Count matrix
Anthon
y Juliu
s The Haml
et Othell
o Macbet
h ..
Tempe .
and Caes
Cleopat ar st
ra
Anthony 157 73 0 0 0 1
Brutus 4 157 0 2 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 8 5 8
worser 2 0 1 1 1 5 24 /
65
Bag of words
model
25 /
65
Term frequency
tf
26 /
65
Instead of raw frequency: Log frequency
weighting
tf t,d → wt,d :
tf
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in both q
and d :
Σ
tf-matching-score(q, d ) t∈q∩d (1 + log tft,d
=
The score is 0 if none of the query terms is present in
the document. )
27 /
65
Exercis
e
28 /
65
Outline
1 Reca
p
2 Why ranked
retrieval?
3 Term
frequency
4 tf-idf weighting
29 /
65
Frequency in document vs. frequency in
collection
30 /
65
Desired weight for rare
terms
31 /
65
Desired weight for frequent
terms
32 /
65
Document
frequency
33 /
65
idf weight
term t as follows:
N
t idft = log10
df
(N is the number of documents in the collection.) idft is a
measure of the informativeness of the term.
34 /
65
Examples for
idf
term dft id ft
calpur
nia 1 6
animal 100 4
sunda
y 1000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
35 /
65
Effect of idf on
ranking
36 /
65
Collection frequency vs. Document
frequency
word collection
frequency document
frequency
insurance 10440 3997
try 10422 8760
37 /
65
tf-idf weighting
N
wt,d = (1 + log tft,d ) · log t
tf-weight idf-weight df
Best known weighting scheme in information
retrieval Alternative names: tf.idf, tf x idf
38 /
65
Summary: tf-idf
Assign
d: a tf-idf weight for each term t in each document
N
wt,d = (1 + log tf t,d ) ·
df t
log weight . . .
The tf-idf
. . . increases with the number of occurrences within a
document. (term frequency)
.(inverse
. . increases with the
document rarity of the term in the collection.
frequency)
39 /
65
Exercise: Term, collection and document
frequency
Quantity Symb
ol Definition
term frequency number
in of occurrences of t
tft,d
d
document
frequency dft number
the of documents in
collection that t occurs in
collection
frequency cft total number of
occurrences of
t in the collection
1 Reca
p
2 Why ranked
retrieval?
3 Term
frequency
4 tf-idf weighting
41 /
65
Binary incidence
matrix Anthon
y Juliu
s The Haml
et Othell
o Macbet
h ..
Tempe .
and Caes
Cleopat ar st
ra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0 42 /
65
Count matrix
Anthon
y Juliu
s The Haml
et Othell
o Macbet
h ..
Tempe .
and Caes
Cleopat ar st
ra
Anthony 157 73 0 0 0 1
Brutus 4 157 0 2 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 8 5 8
worser 2 0 1 1 1 5 43 /
65
Binary → count → weight matrix
Anthony Julius Hamlet Othello
and
Cleopat Caes
ar TheTempest Macb
ra eth ...
weights ∈ R|V |.
44 /
65
Documents as
vectors
space.
45 /
65
Queries as
vectors
46 /
65
How do we formalize vector space
similarity?
47 /
65
Why distance is a bad
idea
poor d2: Rich poor gap grows
1 d1: Ranks of starving poets swell
q: [rich
poor]
48 /
65
Use angle instead of
distance
49 /
65
From angles to
cosines
50 /
65
Cosin
e
51 /
65
Length
normalization
52 /
65
Cosine similarity between query and
document
˙q · ˙d = . Σ |V |
cos(˙q, ˙d ) = sim(˙q, ˙d ) = i =1 q d
.i i
˙
|˙q|| Σ |V | 2 Σ |V | 2
i =1qi i =1di
d|
53 /
65
Cosine for normalized
vectors
54 /
65
Cosine similarity
illustrated
poor
1 ˙v
(d1) ˙v
˙v (q)
(d2)
˙v
0 (d3) rich
0 1
55 /
65
Cosine:
Example
term Sa
S Pa
P W
H
affection 11
5 58 2
0
How similar are these novels? SaS:
jealous 10 7 1
1
Sense and Sensibility PaP:gossip 2 0 6
Pride and Prejudice WH: wuthering 0 0 3
Wuthering Heights 8
56 /
65
Cosine:
Example
57 /
65
Cosine:
Example
log frequency log frequency
weighting weighting & cosine
term Sa
S Pa
P WH term normalization
SaS PaP W
H
affection 3.0
6 2.7
6 2.3
0 affection 0.78
9 0.83
2 0.5
24
jealous 2.0 1.8
5 2.0
4 jealous 0.51
5 0.55
5 0.4
65
gossip 1.3
0 0 1.7
8 gossip 0.33
5 0.0 0.4
05
wuthering 0 0 2.5
8 wuthering 0.0 0.0 0.5
88
cos(SaS,PaP) ≈
0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈
0.94.
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
58 /
65
Computing the cosine
score
CosineScore(q)
1float Scores[N] = 0
2float Length[N]
3for each query term t
4do calculate wt,q and fetch postings list for t
5for each pair(d , tft,d ) in postings list 6
6Read the do
arrayScores[d
Length ]+ = wt,d × wt,q
7for each d
8do Scores[d ] = Scores[d ]/Length[d ]
9return Top K components of Scores[]
59 /
65
Components of tf-idf
weighting
Term frequency Document frequency Normaliz
ation
N √ 1
l (logarithm) 1+ t (idf) log df c (cosine)
t
n (natural)
log(tf t,d ) 0.5×
tftf n (no) 1 n
(none) 1
t
a ,d 0.5 + max t (tf )
t,d
p (prob idf)
t d , max{ 0, logN−df
dft
t
u (pivoted 1/u
(augmented) . unique) w 2 +w 2 +...+w 2
2
1
M
1+log(ave (tf ))
t∈d t,d
Best known combination of weighting options Default:
no
weighting
60 /
65
tf-idf
example
61 /
65
tf-idf example: lnc.ltn
Query:
word “best car insurance”.
query Document: “cardocument
insurance auto insurance”.
product
tf-
raw tf-
wght df id
f wei
ght tf-
raw tf-
wght wei
ght n’liz
ed
auto 0 500
0 0 2.
3 0 1 1 1 0.5
2 0
best 1 1 500
00 1.
3 1.3 0 0 0 0 0
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
car 1 1 100
00 2.
0 2.0 1 1 1 0.5
2 1
.0
4
insura
nce 1 1 100
0 3.
0 3.0 2 1.3 1.3 0.6
8 2
.0
term frequency, df: document frequency, idf: inverse document frequency, weight: the 4
final weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
√12 2 2 2
+0 +1 + 1.3 ≈ 1.92
1/1.92 ≈ 0.52
1.3/1.92 ≈ 0.68 Final similarity score between query and
document:
Σ
i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08 Questions?
62 /
65
Summary: Ranked retrieval in the vector space
model
63 /
65
Take-away
today
64 /
65
Introduction to Information
http://informationretrieval.org
Retrieval
1/
59
Why is ranking so
important?
12 /
59
Empirical investigation of the effect of
ranking
13 /
59
Importance of ranking:
Summary
20 /
59
Outline
1 Reca
p
2 Why
rank?
3 More on
cosine
4 The complete search
system
5 Implementation of ranking
29 /
59
Complete search
system
30 /
59