Lecture 3
Lecture 3
Lecture 3
Query
Formulation Query
Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Documents
source reselection
Delivery
What is a model?
A model is a construct designed help us
understand a complex system
A particular way of “looking at things”
Models inevitably make simplifying assumptions
What are the limitations of the model?
Different types of models:
Conceptual models
Physical analog models
Mathematical models
…
The Central Problem in IR
Information Seeker Authors
Concepts Concepts
Representation Representation
Function Function
Comparison
Function Index
Hits
Models
Boolean model
Based on the notion of sets
Documents are retrieved only if they satisfy Boolean
conditions specified in the query
Does not impose a ranking on retrieved documents
Exact match
Vector space model
Based on geometry, the notion of vectors in high
dimensional space
Documents are ranked based on their similarity to the
query (ranked retrieval)
Best/partial match
Probabilistic Model (Language model)
Based on the notion of probabilities and processes for
generating text
Documents are ranked based on the probability that
they generated the query
Best/partial match
Representing Text
Query Documents
Representation Representation
Function Function
Comparison
Function Index
Hits
How do we represent text?
How do we represent the complexities of
language?
Keeping in mind that computers don’t “understand”
documents or queries
Simple, yet effective approach: “bag of words”
Treat all the words in a document as index terms for
that document
Assign a “weight” to each term based on its
“importance”
Disregard order, structure, meaning, etc. of the words
A B
C
Logic Tables
B 0 1
A B 0 1
0 0 1 1 0
1 1 1 NOT B
A OR B
B B
A 0 1 A 0 1
0 0 0 0 0 0
1 0 1 1 1 0
A AND B A NOT B
(= A AND NOT B)
Boolean Retrieval
Advantages
Results are predictable, relatively easy to explain
Many different features can be incorporated
Efficient processing since many documents can be
eliminated from search
Disadvantages
Effectiveness depends entirely on user
Simple queries usually don’t work well
Complex queries are difficult
Vector Representation
“Bags of words” can be represented as vectors
Why? Computational efficiency, ease of manipulation
Geometric metaphor: “arrows”
A vector is a set of values recorded in any
consistent order
“The quick brown fox jumped over the lazy dog’s back”
[111111112]
1st position corresponds to “back”
2nd position corresponds to “brown”
3rd position corresponds to “dog”
4th position corresponds to “fox”
5th position corresponds to “jump”
6th position corresponds to “lazy”
7th position corresponds to “over”
8th position corresponds to “quick”
9th position corresponds to “the”
Representing Documents
Document 1
Document 2
Document 1
Advantages
Simple computational framework for ranking
Any similarity measure or term weighting scheme could
be used
Disadvantages
Assumption of term independence
No predictions about techniques for effective ranking
Probabilistic Model (Language
model)
Robertson (1977)
“If a reference retrieval system’s response to each
request is a ranking of the documents in the collection
in order of decreasing probability of relevance to the
user who submitted the request,
where the probabilities are estimated as accurately as
possible on the basis of whatever data have been
made available to the system for this purpose,
the overall effectiveness of the system to its user will
be the best that is obtainable on the basis of those
data.”
IR as Classification