Homework: - Let B.R B$u (,1:r) % % Diag (B$D (1:r) ) % % T (B$V (,1:r) )
Homework: - Let B.R B$u (,1:r) % % Diag (B$D (1:r) ) % % T (B$V (,1:r) )
Homework: - Let B.R B$u (,1:r) % % Diag (B$D (1:r) ) % % T (B$V (,1:r) )
• Define
a
loss
func2on
that
compares
two
matrices
(say
mean
square
error)
• b
=
svd(bellcore)
• b2
=
b$u[,1:2]
%*%
diag(b$d[1:2])
%*%
t(b$v[,1:2])
• b3
=
b$u[,1:3]
%*%
diag(b$d[1:3])
%*%
t(b$v[,1:3])
• More
generally,
for
all
possible
r
– Let
b.r
=
b$u[,1:r]
%*%
diag(b$d[1:r])
%*%
t(b$v[,1:r])
• Compute
the
loss
between
bellcore
and
b.r
as
a
func2on
of
r
• Plot
the
loss
as
a
func2on
of
r
IR
Models
• Keywords
(and
Boolean
combina2ons
thereof)
• Vector‐Space
‘‘Model’’
(Salton,
chap
10.1)
– Represent
the
query
and
the
documents
as
V‐
dimensional
vectors
∑ xi ⋅ yi
– Sort
vectors
by
sim(x, y) = cos(x, y) = i
| x |⋅ | y |
• Probabilis2c
Retrieval
Model
– (Salton,
chap
10.3)
Pr(w | rel)
– Sort
documents
by
€ score(d) = ∏
w ∈d Pr(w | rel)
Alterna2ve
IR
models
Instructor: Rada Mihalcea
Some
of
the
slides
were
adopted
from
a
course
tought
at
Cornell
University
by
William
Y.
Arms
Latent
Seman2c
Indexing
Objective
Replace indexes that use sets of index terms by indexes that use concepts.
Approach
Map the term vector space into a lower dimensional space, using singular
value decomposition.
Each dimension in the new space corresponds to a latent concept in the
original data.
Deficiencies
with
Conven2onal
Automa2c
Indexing
Synonymy: Various words and phrases refer to the same concept (lowers recall).
Polysemy: Individual words have more than one meaning (lowers precision)
Independence: No significance is given to two terms that frequently appear together
Latent semantic indexing addresses the first of these (synonymy),
and the third (dependence)
Bellcore’s
Example
h_p://en.wikipedia.org/wiki/Latent_seman2c_analysis
c1 Human machine interface for Lab ABC computer applications
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relation of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
Term
by
Document
Matrix
"bellcore"<‐
structure(.Data
=
c(1,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
1,
1,
1,
0,
1,
0,
0,
0,
0,
1,
0,
1,
1,
0,
0,
1,
0,
0,
0,
0,
1,
0,
0,
0,
2,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
1,
0,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
1,
0,
1,
1),
.Dim
=
c(
12,
9),
.Dimnames
=
list(c("human",
"interface",
"computer",
"user",
"system",
"response",
"@me",
"EPS",
"survey",
"trees",
"graph",
"minors"),
c("c1",
"c2",
"c3",
"c4",
"c5",
"m1",
"m2",
"m3",
"m4")))
help(dump)
help(source)
Query
Expansion
Query:
Find documents relevant to human computer interaction
Simple Term Matching:
Matches c1, c2, and c4
Misses c3 and c5
Large
Correl‐
a2ons
Correla2ons:
Too
Large
to
Ignore
How
to
compute
correla2ons
round(100 * cor(bellcore))
c1 c2 c3 c4 c5 m1 m2 m3 m4
c1 100 -19 0 0 -33 -17 -26 -33 -33
c2 -19 100 0 0 58 -30 -45 -58 -19
c3 0 0 100 47 0 -21 -32 -41 -41
c4 0 0 47 100 -31 -16 -24 -31 -31
c5 -33 58 0 -31 100 -17 -26 -33 -33
m1 -17 -30 -21 -16 -17 100 67 52 -17
m2 -26 -45 -32 -24 -26 67 100 77 26
m3 -33 -58 -41 -31 -33 52 77 100 56
m4 -33 -19 -41 -31 -33 -17 26 56 100
round(100 * cor(t(bellcore)))
human interface computer user system response time EPS survey trees graph minors
human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29
interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29
computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29
user -38 19 19 100 23 76 76 19 19 -50 -50 -38
system 43 4 4 23 100 4 4 82 4 -46 -46 -35
response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29
survey -29 -29 36 19 4 36 36 -29 100 -38 19 36
trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19
graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76
minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100
Height
trees
graph
minors
survey
user
response
time
system
Cluster Dendrogram
EPS
computer
human
plot(hclust(as.dist(‐cor(t(bellcore)))))
interface
plot(hclust(as.dist(‐cor(bellcore))))
Cluster Dendrogram
0.4
0.0
m4
Height
c1
-0.4
c3
c4
-0.8
m1
c2
c5
m2
m3
Correc2ng
for
Large
Correla2ons
Thesaurus
Term
by
Doc
Matrix:
Before
&
Aeer
Thesaurus
Singular
Value
Decomposi2on
(SVD)
X = UDVT
txd txm mxm mxd
D VT
X = U
• m is the rank of X < min(t, d)
• D is diagonal
– D2 are eigenvalues (sorted in descending
order)
• U UT = I and V VT = I
– Columns of U are eigenvectors of X XT
– Columns of V are eigenvectors of XT X
• m is the rank of X < min(t, d)
• D is diagonal
– D2 are eigenvalues (sorted in descending order)
• U UT = I and V VT = I
– Columns of U are eigenvectors of X XT
– Columns of V are eigenvectors of XT X
Dimensionality
Reduc2on
txd txk kxk kxd
D VT
^ =
X U
-1.0
-0.8 0.0
Height
Height
trees
survey
computer
human
interface
m4
graph
minors
user
system
EPS
response
time
c1
c3
c4
m1
c2
c5
m2
m3
as.dist(-cor(bellcore)) as.dist(-cor(t(bellcore)))
hclust (*, "complete") hclust (*, "complete")
-1.0
Height
Height
survey
system
human
EPS
interface
user
trees
response
time
computer
graph
minors
-1.0
c2
c5
m4
c4
m1
m2
m3
c1
c3
as.dist(-cor(b2)) as.dist(-cor(t(b2)))
hclust (*, "complete") hclust (*, "complete")
SVD
B
BT
=
U
D2
UT
BT
B
=
V
D2
VT
Doc
Term
Latent
Dimension
Reduc2on
Block
Structure
round(100*cor(bellcore))
c1 c2 c3 c4 c5 m1 m2 m3 m4
c1 100 -19 0 0 -33 -17 -26 -33 -33
c2 -19 100 0 0 58 -30 -45 -58 -19
c3 0 0 100 47 0 -21 -32 -41 -41
c4 0 0 47 100 -31 -16 -24 -31 -31
c5 -33 58 0 -31 100 -17 -26 -33 -33
m1 -17 -30 -21 -16 -17 100 67 52 -17
m2 -26 -45 -32 -24 -26 67 100 77 26
m3 -33 -58 -41 -31 -33 52 77 100 56
m4 -33 -19 -41 -31 -33 -17 26 56 100
> round(100*cor(b2))
c1 c2 c3 c4 c5 m1 m2 m3 m4
c1 100 91 100 100 84 -86 -85 -85 -81
c2 91 100 91 88 99 -57 -56 -56 -50
c3 100 91 100 100 84 -86 -85 -85 -81
c4 100 88 100 100 81 -89 -88 -88 -84
c5 84 99 84 81 100 -44 -44 -43 -37
m1 -86 -57 -86 -89 -44 100 100 100 100
m2 -85 -56 -85 -88 -44 100 100 100 100
m3 -85 -56 -85 -88 -43 100 100 100 100
m4 -81 -50 -81 -84 -37 100 100 100 100
Dimension
Reduc2on
Block
Structure
round(100*cor(t(bellcore)))
human interface computer user system response time EPS survey trees graph minors
human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29
interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29
computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29
user -38 19 19 100 23 76 76 19 19 -50 -50 -38
system 43 4 4 23 100 4 4 82 4 -46 -46 -35
response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29
survey -29 -29 36 19 4 36 36 -29 100 -38 19 36
trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19
graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76
minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100
> round(100*cor(t(b2)))
human interface computer user system response time EPS survey trees graph minors
human 100 100 93 94 99 82 82 100 -12 -85 -84 -83
interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80
computer 93 95 100 100 96 98 98 93 26 -59 -57 -56
user 94 96 100 100 97 97 97 94 23 -62 -60 -59
system 99 100 96 97 100 88 88 100 -2 -79 -78 -77
response 82 85 98 97 88 100 100 83 46 -40 -38 -37
time 82 85 98 97 88 100 100 83 46 -40 -38 -37
EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82
survey -12 -7 26 23 -2 46 46 -11 100 63 65 66
trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100
graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100
minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100
The
term
vector
space
t3
The space has as
many dimensions as
there are terms in
the word list. d1 d2
θ t2
t1
Latent concept
vector space
• term
document
query
--- cosine > 0.9