Term Weighting 2021
Term Weighting 2021
Term Weighting 2021
By Abdo Ababor
June, 2021
bags of words
Binary Weights
Non-binary weights
Term Frequency
Document Frequency
Inverse Document Frequency
TF*IDF
Similarity Measure
• Euclidean distance
• Inner Product
• Cosine similarity
Recall the previous lecture
What is crawler?
How it works?
3
Crawler
scale of the Web — Trillions of pages distributed among billions of
hosts.
Another consideration is the volume and variety of queries
commercial Web search engines receive
Many pages may change daily or hourly.
Feeds—Document feeds are a mechanism for accessing a real-time
stream of documents. E.g. a news feed is a constant stream of news
stories and updates.
Some content like news, blogs, or video are used for web feeds.
The reader monitors those feeds and provides new content when it
arrives. Radio and television feeds are also used in some search
applications, where the “documents” contain automatically segmented
audio and video streams, together with associated text from closed
captions or speech recognition. 4
Crawler
Conversion: The documents found by a crawler or
provided by a feed are rarely in plain text. E.g. HTML,
XML, Adobe PDF, Microsoft Word, PPT and so on,
search engines require that these documents be converted
into a consistent text plus metadata format.
Document data stored on search engine database
5
bags of words (BOW)
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn
6
Terms
Terms are usually stems. Terms can be also phrases, such as
“Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2, position n to
term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn
W=0 if a term is absent
Documents are represented by binary weights or Non-binary
weighted vectors of terms.
7
Document Collection
A collection of n documents can be represented in the vector
space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term in
the document; zero means the term has no significance in the
document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
8
Binary Weights
• Only the presence (1) or absence (0) docs t1 t2 t3
D1 1 0 1
of a term is included in the vector D2 1 0 0
D3 0 1 1
• Binary formula gives every word that D4
D5
1
1
0
1
0
1
appears in a document equal D6 1 1 0
relevance. D7 0 1 0
D8 0 1 0
• It can be useful when frequency is D9 0 0 1
not important. D10 0 1 1
D11 1 0 1
frequency
Reduce the tf weight of a term by a factor that grows with
14
Inverse Document Frequency (IDF)
IDF measures rarity of the term in collection.
The IDF is a measure of the general importance of the term
It is the inverse of the document frequency.
It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
Gives full weight to terms that occur in one document only.
Gives lowest weight to terms that occur in all documents.
Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)
15
Inverse Document Frequency
• E.g. given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values for
common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ?
16
TF*IDF Weighting
The most used term-weighting is tf*idf weighting
scheme: wij = tfij idfi = tfij * log2 (N/ dfi)
19
More Example
Consider a document containing 100 words wherein the word
cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one 1000 of these.
t2 d5
d4
Postulate: Documents that are “close together”
in the vector space talk about the same things and are
more similar than others.
Similarity Measure
If d1 is near d2, then d2 is near d1.
If d1 near d2, and d2 near d3, then d1 is not far from d3.
product
the dot product is defined as the product of the magnitudes
i 1
29
Inner Product
Similarity between vectors for the document di and query q
can be computed as the vector inner product:
n
sim(dj,q) = dj•q = wij · wiq =
wi * qi
i 1
where wij is the weight of term i in document j and wiq is the
weight of term i in the query q
For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
Measures how many terms matched but not how many terms are
not matched.
30
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7 sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 0 0 2
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Inner Product: Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
32
Inner Product: Exercise
k2
k1 d2
d6 d7
d4 d5
d3
d1
k1 k2 k3 q dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3 33
Cosine similarity
Measures similarity between d1 and d2 captured by the cosine of the
angle x between them.
n
d j q wi , j wi , q
sim(d j , q ) i 1
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
Or; n
d j dk wi , j wi ,k
sim(d j , d k ) i 1
cos 2 0.98
0.8
0.6 2
0.4
1 D1
0.2
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
37
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle
between two vectors.
Inner product normalized by the vector lengths.
t
dj q
(wij wiq)
CosSim(dj, q) = i 1
t t
wij wiq
2 2
dj q
i 1 i 1
Inner Product(dj, q) = d j q
39