Chapter 1: Boolean Retrieval
Chapter 1: Boolean Retrieval
Chapter 1: Boolean Retrieval
● The meaning of the term information retrieval can be very broad. In academics, IR is
finding material of an unstructured nature that satisfies an information need from within
large collections.
● The term unstructured data refers to data that does not have a clear, semantically overt
easy-for-a computer structure.
● IR systems can be distinguished by the scale at which they operate:
○ Web search
○ Personal: On your machine, but has a variety of document types.
○ Enterprise, institutional & domain-specific: Here documents are typically
stored on centralized file systems.
● Grepping through text is to do some type of linear scan through the documents. It often
allows useful possibilities for wildcard pattern matching through the use of regular
expressions. However, this has some disadvantages too -
○ To process large amount of data
○ To allow more flexible form of matching. Like NEAR.
○ To allow ranked retrieval.
● The way to avoid linearly scanning the texts for each query is to index the documents in
advance.
Here, we make a vector, if a word is present in a document. Here, the terms are
the indexed units; they are usually words. Using the vector, for each document we can
make a term incidence matrix for all the documents in the collection.
Once, we have the matrix, we can make boolean queries using the vectors to find
the result for any query. Ex.
● Documents refer to whatever units we have decided to build a retrieval system over.
Collections are the set of all the documents we are going to use. Sometimes also
referred to as corpus.
● Ad hoc retrieval: Most standard IR task. Here, we try to provide documents from the
collection that are relevant to an arbitrary user information need, communicated as a
query.
● An information need is a topic about which the user desires to know more, and is
differentiated from a query, which is what the user conveys to the computer in an
attempt to communicate the information need.
● A document is relevant if it is one that the user perceives as containing information of
value with respect to their personal information need.
● Effectiveness of a system is measured by: Precision ( What fraction of the returned
results are relevant to the information need ?) and Recall ( What fraction of the relevant
documents in the collection were returned by the system? )
● But however, making such a large matrix is not possible. Hence we come to the idea of
inverted index.
○ We keep a dictionary of terms ( sometimes also referred to as vocabulary or
lexicon) then for each term, we have a list that records which documents the
term occurs in. Each item in the list - which records that a term appeared in a
document - is called a postings.
-------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------
4. Index Construction
-------------------------------------------------------------------------------------------------------------------------------
5. Index Compression
● Why compression ?
○ Increased usage of caching is possible.
○ Faster transfer of data from disk to main memory.
5.1 Statistical properties of terms in information retrieval
● The compression techniques we describe in the remainder of this chapter are lossless,
that is, all information is preserved. Better compression ratios can be achieved with
lossy compression( case folding, stemming, stop word removal), which discards some
information.
● Heap’s Law : Estimating number of terms
○ A better way of getting a handle on M (number of distinct terms / vocabulary size)
is Heap’s law.
M = k * (T^b)
where,
T = number of tokens in the collection
Parameters 30 < k < 100 and b ~ 0.
○ The parameter k is quite variable because vocabulary growth depends a lot on
the nature of the collection and how it is processed. Case-folding and stemming
reduce the growth rate of the vocabulary, whereas including numbers and
spelling errors increase it.
● Zipf’s law : Modeling the distribution of terms
○ Model of distribution of terms in a collection is Zipf’s law.
○ States that, if t1 is the most common term in the collection, t2 is the next most
common, and so on, then the collection frequency cf(i) of the i th most common
term is proportional to 1/i: cf(i) * i ~~ const.
-------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------