I R Rank
I R Rank
I R Rank
Information Retrieval
Introducing ranked retrieval
Ch. 6
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of their needs and the
collection.
Also good for applications: Applications can easily consume 1000s of results.
Not good for the majority of users.
Most users incapable of writing Boolean queries (or they are, but they think it’s
too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Ch. 6
5
Ch. 6
When a system produces a ranked result set, large result sets are not an issue
Indeed, the size of the result set is not an issue
We just show the top k ( ≈ 10) results
We don’t overwhelm the user
We wish to return in order the documents most likely to be useful to the
searcher
How can we rank-order the documents in the collection with respect to a
query?
Assign a score – say in [0, 1] – to each document
This score measures how well document and query “match”.
Ch. 6
It doesn’t consider term frequency (how many times a term occurs in a
document)
Rare terms in a collection are more informative than frequent terms
Jaccard doesn’t consider this information
We need a more sophisticated way of normalizing for length
we’ll use
. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.
| A B | / | A B |
Introduction to
Information Retrieval
Term frequency weighting
Sec. 6.2
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
The bag of words representation
recommend 1
laugh 1
happy 1
... ...
Bag of words model
Vector representation doesn’t consider the ordering of words in a document
John is quicker than Mary and Mary is quicker than John have the same
vectors
Log-frequency weighting
The log frequency weight of term t in d is
𝑤𝑡 , 𝑑=
0, {
1 + lo g 10 t f 𝑡 , 𝑑 , if t f 𝑡 , 𝑑 > 0
otherwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
score ¿ ∑ (1 + log t f 𝑡 , 𝑑)
𝑡 ∈𝑞∩ 𝑑
The score is 0 if none of the query terms is present in the document.
Introduction to
Information Retrieval
(Inverse) Document frequency weighting
Sec. 6.2.1
Document frequency
Rare terms are more informative than frequent terms
Recall stop words
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
A document containing this term is very likely to be relevant to the query
arachnocentric
→ We want a high weight for rare terms like arachnocentric.
Sec. 6.2.1
idf weight
We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
26
Effect of idf on ranking
Question: Does idf have an effect on ranking for one-term queries, like
iPhone
idf has no effect on ranking one term queries
idf affects the ranking of documents for queries with at least two terms
For the query capricious person, idf weighting makes occurrences of capricious
count for much more in the final document ranking than occurrences of person.
27
Sec. 6.2.1
Which word is a better search term (and should get a higher weight)?
Introduction to
Information Retrieval
tf-idf weighting
Sec. 6.2.2
tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight.
31
Sec. 6.3
Documents as vectors
Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the space
Key idea 2: Rank documents according to their proximity to the query in this
space
proximity = similarity of vectors
proximity ≈ inverse of distance
Recall: We do this because we want to get away from the you’re-either-in-or-
out Boolean model
Instead: rank more relevant documents higher than less relevant documents
Sec. 6.3
Length normalization
Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of
unit hypersphere)
Effect on the two documents d and d′ (d appended to itself) from earlier slide:
they have identical vectors after length-normalization.
Long and short documents now have comparable weights
Sec. 6.3
cosine(query,document)
Dot product Unit vectors
|𝑉 |
⃗•⃗
𝑞 𝑑 ⃗
𝑞 ⃗
𝑑
∑ 𝑞𝑖 𝑑𝑖
⃗
⃗ , 𝑑)=
cos (¿ 𝑞 = • =
𝑖=1
¿
𝑞||𝑑| |𝑑|
√ √
|⃗ ⃗ |⃗
𝑞 | ⃗ |𝑉 | |𝑉|
∑ 𝑞2𝑖 ∑ 𝑑𝑖2
𝑖=1 𝑖=1
For length-normalized vectors, cosine similarity is simply the dot product (or
scalar product):
for q, d length-normalized.
43
Cosine similarity illustrated
44
Sec. 6.3
Sensibility jealous 10 7 11
WH: Wuthering
Term frequencies (counts)
Heights?
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69